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Preface to the Second Edition 


Our aim in writing the original edition of Numerical Recipes was to provide a 
book that combined general discussion, analytical mathematics, algorithmics, and 
actual working programs. The success of the first edition puts us now in a difficult, 
though hardly unenviable, position. We wanted, then and now, to write a book 
that is informal, fearlessly editorial, unesoteric, and above all useful. There is a 
danger that, if we are not careful, we might produce a second edition that is weighty, 
balanced, scholarly, and boring. 

It is a mixed blessing that we know more now than we did six years ago. Then, 
we were making educated guesses, based on existing literature and our own research, 
about which numerical techniques were the most important and robust. Now, we have 
the benefit of direct feedback from a large reader community. Letters to our alter-ego 
enterprise, Numerical Recipes Software, are in the thousands per year. (Please, don’t 
telephone us.) Our post office box has become a magnet for letters pointing out 
that we have omitted some particular technique, well known to be important in a 
particular field of science or engineering. We value such letters, and digest them 
carefully, especially when they point us to specific references in the literature. 

The inevitable result of this input is that this Second Edition of Numerical 
Recipes is substantially larger than its predecessor, in fact about 50% larger both in 
words and number of included programs (the latter now numbering well over 300). 
“Don’t let the book grow in size,” is the advice that we received from several wise 
colleagues. We have tried to follow the intended spirit of that advice, even as we 
violate the letter of it. We have not lengthened, or increased in difficulty, the book’s 
principal discussions of mainstream topics. Many new topics are presented at this 
same accessible level. Some topics, both from the earlier edition and new to this 
one, are now set in smaller type that labels them as being “advanced.” The reader 
who ignores such advanced sections completely will not, we think, find any lack of 
continuity in the shorter volume that results. 

Here are some highlights of the new material in this Second Edition: 

• a new chapter on integral equations and inverse methods 

• a detailed treatment of multigrid methods for solving elliptic partial 
differential equations 

• routines for band diagonal linear systems 

• improved routines for linear algebra on sparse matrices 

• Cholesky and QR decomposition 

• orthogonal polynomials and Gaussian quadratures for arbitrary weight 
functions 

• methods for calculating numerical derivatives 

• Pade approximants, and rational Chebyshev approximation 

• Bessel functions, and modified Bessel functions, of fractional order; and 
several other new special functions 

• improved random number routines 

• quasi-random sequences 

• routines for adaptive and recursive Monte Carlo integration in high¬ 
dimensional spaces 

• globally convergent methods for sets of nonlinear equations 
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• simulated annealing minimization for continuous control spaces 

• fast Fourier transform (FFT) for real data in two and three dimensions 

• fast Fourier transform (FFT) using external storage 

• improved fast cosine transform routines 

• wavelet transforms 

• Fourier integrals with upper and lower limits 

• spectral analysis on unevenly sampled data 

• Savitzky-Golay smoothing filters 

• fitting straight line data with errors in both coordinates 

• a two-dimensional Kolmogorov-Smirnoff test 

• the statistical bootstrap method 

• embedded Runge-Kutta-Fehlberg methods for differential equations 

• high-order methods for stiff differential equations 

• a new chapter on “less-numerical” algorithms, including Huffman and 
arithmetic coding, arbitrary precision arithmetic, and several other topics. 

Consult the Preface to the First Edition, following, or the Table of Contents, for a 
list of the more “basic” subjects treated. 
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compilers - too numerous (and sometimes too buggy) for individual acknowledg¬ 
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in this book) has uncovered compiler bugs in many of the compilers tried. When 
possible, we work with developers to see that such bugs get fixed; we encourage 
interested compiler developers to contact us about such arrangements. 
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Preface to the First Edition 


We call this book Numerical Recipes for several reasons. In one sense, this book 
is indeed a “cookbook” on numerical computation. However there is an important 
distinction between a cookbook and a restaurant menu. The latter presents choices 
among complete dishes in each of which the individual flavors are blended and 
disguised. The former — and this book — reveals the individual ingredients and 
explains how they are prepared and combined. 

Another purpose of the title is to connote an eclectic mixture of presentational 
techniques. This book is unique, we think, in offering, for each topic considered, 
a certain amount of general discussion, a certain amount of analytical mathematics, 
a certain amount of discussion of algorithmics, and (most important) actual imple¬ 
mentations of these ideas in the form of working computer routines. Our task has 
been to find the right balance among these ingredients for each topic. You will 
find that for some topics we have tilted quite far to the analytic side; this where we 
have felt there to be gaps in the “standard” mathematical training. For other topics, 
where the mathematical prerequisites are universally held, we have tilted towards 
more in-depth discussion of the nature of the computational algorithms, or towards 
practical questions of implementation. 

We admit, therefore, to some unevenness in the “level” of this book. About half 
of it is suitable for an advanced undergraduate course on numerical computation for 
science or engineering majors. The other half ranges from the level of a graduate 
course to that of a professional reference. Most cookbooks have, after all, recipes at 
varying levels of complexity. An attractive feature of this approach, we think, is that 
the reader can use the book at increasing levels of sophistication as his/her experience 
grows. Even inexperienced readers should be able to use our most advanced routines 
as black boxes. Having done so, we hope that these readers will subsequently go 
back and learn what secrets are inside. 

If there is a single dominant theme in this book, it is that practical methods 
of numerical computation can be simultaneously efficient, clever, and — important 
— clear. The alternative viewpoint, that efficient computational methods must 
necessarily be so arcane and complex as to be useful only in “black box” form, 
we firmly reject. 

Our purpose in this book is thus to open up a large number of computational 
black boxes to your scrutiny. We want to teach you to take apart these black boxes 
and to put them back together again, modifying them to suit your specific needs. 
We assume that you are mathematically literate, i.e., that you have the normal 
mathematical preparation associated with an undergraduate degree in a physical 
science, or engineering, or economics, or a quantitative social science. We assume 
that you know how to program a computer. We do not assume that you have any 
prior formal knowledge of numerical analysis or numerical methods. 

The scope of Numerical Recipes is supposed to be “everything up to, but 
not including, partial differential equations.” We honor this in the breach: First, 
we do have one introductory chapter on methods for partial differential equations 
(Chapter 19). Second, we obviously cannot include everything else. All the so-called 
“standard” topics of a numerical analysis course have been included in this book: 
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linear equations (Chapter 2), interpolation and extrapolation (Chaper 3), integration 
(Chaper 4), nonlinear root-finding (Chapter 9), eigensystems (Chapter 11), and 
ordinary differential equations (Chapter 16). Most of these topics have been taken 
beyond their standard treatments into some advanced material which we have felt 
to be particularly important or useful. 

Some other subjects that we cover in detail are not usually found in the standard 
numerical analysis texts. These include the evaluation of functions and of particular 
special functions of higher mathematics (Chapters 5 and 6); random numbers and 
Monte Carlo methods (Chapter 7); sorting (Chapter 8); optimization, including 
multidimensional methods (Chapter 10); Fourier transform methods, including FFT 
methods and other spectral methods (Chapters 12 and 13); two chapters on the 
statistical description and modeling of data (Chapters 14 and 15); and two-point 
boundary value problems, both shooting and relaxation methods (Chapter 17). 

The programs in this book are included in ANSI-standard C. Versions of the 
book in FORTRAN, Pascal, and BASIC are available separately. We have more 
to say about the C language, and the computational environment assumed by our 
routines, in §1.1 (Introduction). 

Acknowledgments 

Many colleagues have been generous in giving us the benefit of their numerical 
and computational experience, in providing us with programs, in commenting on 
the manuscript, or in general encouragement. We particularly wish to thank George 
Rybicki, Douglas Eardley, Philip Marcus, Stuart Shapiro, Paul Horowitz, Bruce 
Musicus, Irwin Shapiro, Stephen Wolfram, Henry Abarbanel, Larry Smarr, Richard 
Muller, John Bahcall, and A.G.W. Cameron. 

We also wish to acknowledge two individuals whom we have never met: Forman 
Acton, whose 1970 textbook Numerical Methods that Work (New York: Harper and 
Row) has surely left its stylistic mark on us; and Donald Knuth, both for his series 
of books on The Art of Computer Programming (Reading, MA: Addison-Wesley), 
and for TpX, the computer typesetting language which immensely aided production 
of this book. 

Research by the authors on computational methods was supported in part by 
the U.S. National Science Foundation. 

October, 1985 William H. Press 

Brian P. Flannery 
Saul A. Teukolsky 
William T. Vetterling 
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License Information 


Read this section if you want to use the programs in this book on a computer. 
You’ll need to read the following Disclaimer of Warranty, get the programs onto your 
computer, and acquire a Numerical Recipes software license. (Without this license, 
which can be the free “immediate license” under terms described below, the book is 
intended as a text and reference book, for reading purposes only.) 

Disclaimer of Warranty 

We make no warranties, express or implied, that the programs contained 
in this volume are free of error, or are consistent with any particular standard 
of merchantability, or that they will meet your requirements for any particular 
application. They should not be relied on for solving a problem whose incorrect 
solution could result in injury to a person or loss of property. If you do use the 
programs in such a manner, it is at your own risk. The authors and publisher 
disclaim all liability for direct or consequential damages resulting from your 
use of the programs. 

How to Get the Code onto Your Computer 

Pick one of the following methods: 

• You can type the programs from this book directly into your computer. In this 
case, the only kind of license available to you is the free “immediate license” 
(see below). You are not authorized to transfer or distribute a machine-readable 
copy to any other person, nor to have any other person type the programs into a 
computer on your behalf. We do not want to hear bug reports from you if you 
choose this option, because experience has shown that virtually all reported bugs 
in such cases are typing errors! 

• You can download the Numerical Recipes programs electronically from the 
Numerical Recipes On-Line Software Store, located at http: //www . nr . com, our 
Web site. All the files (Recipes and demonstration programs) are packaged as 
a single compressed file. You’ll need to purchase a license to download and 
unpack them. Any number of single-screen licenses can be purchased instantly 
(with discount for multiple screens) from the On-Line Store, with fees that depend 
on your operating system (Windows or Macintosh versus Linux or UNIX) and 
whether you are affiliated with an educational institution. Purchasing a single¬ 
screen license is also the way to start if you want to acquire a more general (site 
or corporate) license; your single-screen cost will be subtracted from the cost of 
any later license upgrade. 

• You can purchase media containing the programs from Cambridge University Press. 

A CD-ROM version in ISO-9660 format for Windows and Macintosh systems 
contains the complete C software, and also the C++ version. More extensive CD- 
ROMs in ISO-9660 format for Windows, Macintosh, and UNIX/Linux systems are 
also available; these include the C, C++, and Fortran versions on a single CD-ROM 
(as well as versions in Pascal and BASIC from the first edition). These CD-ROMs 
are available with a single-screen license for Windows or Macintosh (order ISBN 
0 521 750350), or (at a slightly higher price) with a single-screen license for 
UNIX/Linux workstations (order ISBN 0 521 750369). Orders for media from 
Cambridge University Press can be placed at 800 872-7423 (North America only) 
or by email to orders@cup.org (North America) or directcustserv@cambridge.org 
(rest of world). Or, visit the Web site http: //www. Cambridge. org. 
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Types of License Offered 

Here are the types of licenses that we offer. Note that some types are 
automatically acquired with the purchase of media from Cambridge University 
Press, or of an unlocking password from the Numerical Recipes On-Line Software 
Store, while other types of licenses require that you communicate specifically with 
Numerical Recipes Software (email: orders@nr.com or fax: 781 863-1739). Our 
Web site http: //www. nr. com has additional information. 

• [“Immediate License”] If you are the individual owner of a copy of this book and 
you type one or more of its routines into your computer, we authorize you to use 
them on that computer for your own personal and noncommercial purposes. You 
are not authorized to transfer or distribute machine-readable copies to any other 
person, or to use the routines on more than one machine, or to distribute executable 
programs containing our routines. This is the only free license. 

• [“Single-Screen License”] This is the most common type of low-cost license, with 
terms governed by our Single Screen (Shrinkwrap) License document (complete 
terms available through our Web site). Basically, this license lets you use Numerical 
Recipes routines on any one screen (PC, workstation, X-terminal, etc.). You may 
also, under this license, transfer pre-compiled, executable programs incorporating 
our routines to other, unlicensed, screens or computers, providing that (i) your 
application is noncommercial (i.e., does not involve the selling of your program 
for a fee), (ii) the programs were first developed, compiled, and successfully run 
on a licensed screen, and (iii) our routines are bound into the programs in such a 
manner that they cannot be accessed as individual routines and cannot practicably 
be unbound and used in other programs. That is, under this license, your program 
user must not be able to use our programs as part of a program library or “mix-and- 
match” workbench. Conditions for other types of commercial or noncommercial 
distribution may be found on our Web site (http: //www .nr.com). 

• [“Multi-Screen, Server, Site, and Corporate Licenses”] The terms of the Single 
Screen License can be extended to designated groups of machines, defined by 
number of screens, number of machines, locations, or ownership. Significant 
discounts from the corresponding single-screen prices are available when the 
estimated number of screens exceeds 40. Contact Numerical Recipes Software 
(email: orders@nr.com or fax: 781 863-1739) for details. 

• [“Course Right-to-Copy License”] Instructors at accredited educational institutions 
who have adopted this book for a course, and who have already purchased a Single 
Screen License (either acquired with the purchase of media, or from the Numerical 
Recipes On-Line Software Store), may license the programs for use in that course 
as follows: Mail your name, title, and address; the course name, number, dates, 
and estimated enrollment; and advance payment of $5 per (estimated) student 
to Numerical Recipes Software, at this address: RO. Box 243, Cambridge, MA 
02238 (USA). You will receive by return mail a license authorizing you to make 
copies of the programs for use by your students, and/or to transfer the programs to 
a machine accessible to your students (but only for the duration of the course). 

About Copyrights on Computer Programs 

Like artistic or literary compositions, computer programs are protected by 
copyright. Generally it is an infringement for you to copy into your computer a 
program from a copyrighted source. (It is also not a friendly thing to do, since it 
deprives the program’s author of compensation for his or her creative effort.) Under 
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copyright law, all “derivative works” (modified versions, or translations into another 
computer language) also come under the same copyright as the original work. 

Copyright does not protect ideas, but only the expression of those ideas in 
a particular form. In the case of a computer program, the ideas consist of the 
program’s methodology and algorithm, including the necessary sequence of steps 
adopted by the programmer. The expression of those ideas is the program source 
code (particularly any arbitrary or stylistic choices embodied in it), its derived object 
code, and any other derivative works. 

If you analyze the ideas contained in a program, and then express those 
ideas in your own completely different implementation, then that new program 
implementation belongs to you. That is what we have done for those programs in 
this book that are not entirely of our own devising. When programs in this book are 
said to be “based” on programs published in copyright sources, we mean that the 
ideas are the same. The expression of these ideas as source code is our own. We 
believe that no material in this book infringes on an existing copyright. 

Trademarks 

Several registered trademarks appear within the text of this book: Sun is a 
trademark of Sun Microsystems, Inc. SPARC and SPARCstation are trademarks 
of SPARC International, Inc. Microsoft, Windows 95, Windows NT, PowerStation, 
and MS are trademarks of Microsoft Corporation. DEC, VMS, Alpha AXP, and 
ULTRIX are trademarks of Digital Equipment Corporation. IBM is a trademark of 
International Business Machines Corporation. Apple and Macintosh are trademarks 
of Apple Computer, Inc. UNIX is a trademark licensed exclusively through X/Open 
Co. Ltd. IMSL is a trademark of Visual Numerics, Inc. NAG refers to proprietary 
computer software of Numerical Algorithms Group (USA) Inc. PostScript and 
Adobe Illustrator are trademarks of Adobe Systems Incorporated. Last, and no doubt 
least, Numerical Recipes (when identifying products) is a trademark of Numerical 
Recipes Software. 

Attributions 

The fact that ideas are legally “free as air” in no way supersedes the ethical 
requirement that ideas be credited to their known originators. When programs in 
this book are based on known sources, whether copyrighted or in the public domain, 
published or “handed-down,” we have attempted to give proper attribution. Unfor¬ 
tunately, the lineage of many programs in common circulation is often unclear. We 
would be grateful to readers for new or corrected information regarding attributions, 
which we will attempt to incorporate in subsequent printings. 
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1.0 

flmoon 

calculate phases of the moon by date 

1.1 

julday 

Julian Day number from calendar date 

1.1 

badluk 

Friday the 13th when the moon is full 

1.1 

caldat 

calendar date from Julian day number 

2.1 

gaussj 

Gauss-Jordan matrix inversion and linear equation 
solution 

2.3 

ludcmp 

linear equation solution, LU decomposition 

2.3 

lubksb 

linear equation solution, backsubstitution 

2.4 

tridag 

solution of tridiagonal systems 

2.4 

banmul 

multiply vector by band diagonal matrix 

2.4 

bandec 

band diagonal systems, decomposition 

2.4 

banbks 

band diagonal systems, backsubstitution 

2.5 

mprove 

linear equation solution, iterative improvement 

2.6 

svbksb 

singular value backsubstitution 

2.6 

svdcmp 

singular value decomposition of a matrix 

2.6 

pythag 

calculate (a 2 + 6 2 ) 1 / 2 without overflow 

2.7 

cyclic 

solution of cyclic tridiagonal systems 

2.7 

sprsin 

convert matrix to sparse format 

2.7 

sprsax 

product of sparse matrix and vector 

2.7 

sprstx 

product of transpose sparse matrix and vector 

2.7 

sprstp 

transpose of sparse matrix 

2.7 

sprspm 

pattern multiply two sparse matrices 

2.7 

sprstm 

threshold multiply two sparse matrices 

2.7 

linbcg 

biconjugate gradient solution of sparse systems 

2.7 

snrm 

used by linbcg for vector norm 

2.7 

atimes 

used by linbcg for sparse multiplication 

2.7 

asolve 

used by linbcg for preconditioner 

2.8 

vander 

solve Vandermonde systems 

2.8 

toeplz 

solve Toeplitz systems 

2.9 

choldc 

Cholesky decomposition 

2.9 

cholsl 

Cholesky backsubstitution 

2.10 

qrdcmp 

QR decomposition 

2.10 

qrsolv 

QR backsubstitution 

2.10 

rsolv 

right triangular backsubstitution 

2.10 

qrupdt 

update a QR decomposition 

2.10 

rotate 

Jacobi rotation used by qrupdt 

3.1 

polint 

polynomial interpolation 

3.2 

ratint 

rational function interpolation 

3.3 

spline 

construct a cubic spline 

3.3 

splint 

cubic spline interpolation 

3.4 

locate 

search an ordered table by bisection 
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3.4 

hunt 

search a table when calls are correlated 

3.5 

polcoe 

polynomial coefficients from table of values 

3.5 

polcof 

polynomial coefficients from table of values 

3.6 

polin2 

two-dimensional polynomial interpolation 

3.6 

bcucof 

construct two-dimensional bicubic 

3.6 

bcuint 

two-dimensional bicubic interpolation 

3.6 

splie2 

construct two-dimensional spline 

3.6 

splin2 

two-dimensional spline interpolation 

4.2 

trapzd 

trapezoidal rule 

4.2 

qtrap 

integrate using trapezoidal rule 

4.2 

qsimp 

integrate using Simpson’s rule 

4.3 

qromb 

integrate using Romberg adaptive method 

4.4 

midpnt 

extended midpoint rule 

4.4 

qromo 

integrate using open Romberg adaptive method 

4.4 

midinf 

integrate a function on a semi-infinite interval 

4.4 

midsql 

integrate a function with lower square-root singularity 

4.4 

midsqu 

integrate a function with upper square-root singularity 

4.4 

midexp 

integrate a function that decreases exponentially 

4.5 

qgaus 

integrate a function by Gaussian quadratures 

4.5 

gauleg 

Gauss-Legendre weights and abscissas 

4.5 

gaulag 

Gauss-Laguerre weights and abscissas 

4.5 

gauher 

Gauss-Hermite weights and abscissas 

4.5 

gauj ac 

Gauss-Jacobi weights and abscissas 

4.5 

gaucof 

quadrature weights from orthogonal polynomials 

4.5 

orthog 

construct nonclassical orthogonal polynomials 

4.6 

quad3d 

integrate a function over a three-dimensional space 

5.1 

eulsum 

sum a series by Euler-van Wijngaarden algorithm 

5.3 

ddpoly 

evaluate a polynomial and its derivatives 

5.3 

poldiv 

divide one polynomial by another 

5.3 

ratval 

evaluate a rational function 

5.7 

dfridr 

numerical derivative by Ridders’ method 

5.8 

chebft 

fit a Chebyshev polynomial to a function 

5.8 

chebev 

Chebyshev polynomial evaluation 

5.9 

chder 

derivative of a function already Chebyshev fitted 

5.9 

chint 

integrate a function already Chebyshev fitted 

5.10 

chebpc 

polynomial coefficients from a Chebyshev fit 

5.10 

pcshft 

polynomial coefficients of a shifted polynomial 

5.11 

pccheb 

inverse of chebpc; use to economize power series 

5.12 

pade 

Pade approximant from power series coefficients 

5.13 

ratlsq 

rational fit by least-squares method 

6.1 

gammln 

logarithm of gamma function 

6.1 

factrl 

factorial function 

6.1 

bico 

binomial coefficients function 

6.1 

factln 

logarithm of factorial function 
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6.1 

beta 

beta function 

6.2 

gammp 

incomplete gamma function 

6.2 

gammq 

complement of incomplete gamma function 

6.2 

gser 

series used by gammp and gammq 

6.2 

gcf 

continued fraction used by gammp and gammq 

6.2 

erff 

error function 

6.2 

erff c 

complementary error function 

6.2 

erf cc 

complementary error function, concise routine 

6.3 

expint 

exponential integral E n 

6.3 

ei 

exponential integral Ei 

6.4 

betai 

incomplete beta function 

6.4 

betacf 

continued fraction used by betai 

6.5 

bessjO 

Bessel function Jo 

6.5 

bessyO 

Bessel function Yq 

6.5 

bessj1 

Bessel function J\ 

6.5 

bessyl 

Bessel function Y\ 

6.5 

bessy 

Bessel function Y of general integer order 

6.5 

bessj 

Bessel function J of general integer order 

6.6 

bessiO 

modified Bessel function Jo 

6.6 

besskO 

modified Bessel function Jfo 

6.6 

bessil 

modified Bessel function I\ 

6.6 

besskl 

modified Bessel function K\ 

6.6 

bessk 

modified Bessel function K of integer order 

6.6 

bessi 

modified Bessel function / of integer order 

6.7 

bessjy 

Bessel functions of fractional order 

6.7 

beschb 

Chebyshev expansion used by bessjy 

6.7 

bessik 

modified Bessel functions of fractional order 

6.7 

airy 

Airy functions 

6.7 

sphbes 

spherical Bessel functions j n and y n 

6.8 

plgndr 

Legendre polynomials, associated (spherical harmonics) 

6.9 

frenel 

Fresnel integrals S(x) and C(x) 

6.9 

cisi 

cosine and sine integrals Ci and Si 

6.10 

dawson 

Dawson’s integral 

6.11 

rf 

Carlson’s elliptic integral of the first kind 

6.11 

rd 

Carlson’s elliptic integral of the second kind 

6.11 

rj 

Carlson’s elliptic integral of the third kind 

6.11 

rc 

Carlson’s degenerate elliptic integral 

6.11 

ellf 

Legendre elliptic integral of the first kind 

6.11 

elle 

Legendre elliptic integral of the second kind 

6.11 

ellpi 

Legendre elliptic integral of the third kind 

6.11 

sncndn 

Jacobian elliptic functions 

6.12 

hypgeo 

complex hypergeometric function 

6.12 

hypser 

complex hypergeometric function, series evaluation 

6.12 

hypdrv 

complex hypergeometric function, derivative of 

7.1 

ranO 

random deviate by Park and Miller minimal standard 

7.1 

rani 

random deviate, minimal standard plus shuffle 
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7.1 ran2 random deviate by L’Ecuyer long period plus shuffle 

7.1 ran3 random deviate by Knuth subtractive method 

7.2 expdev exponential random deviates 

7.2 gasdev normally distributed random deviates 

7.3 gamdev gamma-law distribution random deviates 

7.3 poidev Poisson distributed random deviates 

7.3 bnldev binomial distributed random deviates 

7.4 irbitl random bit sequence 

7.4 irbit2 random bit sequence 

7.5 psdes “pseudo-DES” hashing of 64 bits 

7.5 ran4 random deviates from DES-like hashing 

7.7 sobseq Sobol’s quasi-random sequence 

7.8 vegas adaptive multidimensional Monte Carlo integration 

7.8 rebin sample rebinning used by vegas 

7.8 miser recursive multidimensional Monte Carlo integration 

7.8 ranpt get random point, used by miser 

8.1 piksrt sort an array by straight insertion 

8.1 piksr2 sort two arrays by straight insertion 

8.1 shell sort an array by Shell’s method 

8.2 sort sort an array by quicksort method 

8.2 sort2 sort two arrays by quicksort method 

8.3 hpsort sort an array by heapsort method 

8.4 indexx construct an index for an array 

8.4 sort3 sort, use an index to sort 3 or more arrays 

8.4 rank construct a rank table for an array 

8.5 select find the A r th largest in an array 

8.5 selip find the iVth largest, without altering an array 

8.5 hpsel find M largest values, without altering an array 

8.6 eclass determine equivalence classes from list 

8.6 eclazz determine equivalence classes from procedure 

9.0 scrsho graph a function to search for roots 

9.1 zbrac outward search for brackets on roots 

9.1 zbrak inward search for brackets on roots 

9.1 rtbis find root of a function by bisection 

9.2 rtf lsp find root of a function by false-position 

9.2 rtsec find root of a function by secant method 

9.2 zriddr find root of a function by Ridders’ method 

9.3 zbrent find root of a function by Brent’s method 

9.4 rtnewt find root of a function by Newton-Raphson 

9.4 rtsafe find root of a function by Newton-Raphson and bisection 

9.5 laguer find a root of a polynomial by Laguerre’s method 

9.5 zroots roots of a polynomial by Laguerre’s method with 

deflation 

zrhqr roots of a polynomial by eigenvalue methods 

qroot complex or double root of a polynomial, Bairstow 
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9.6 

mnewt 

Newton’s method for systems of equations 

9.7 

lnsrch 

search along a line, used by newt 

9.7 

newt 

globally convergent multi-dimensional Newton’s method 

9.7 

f djac 

finite-difference Jacobian, used by newt 

9.7 

fmin 

norm of a vector function, used by newt 

9.7 

broydn 

secant method for systems of equations 

10.1 

mnbrak 

bracket the minimum of a function 

10.1 

golden 

find minimum of a function by golden section search 

10.2 

brent 

find minimum of a function by Brent’s method 

10.3 

dbrent 

find minimum of a function using derivative information 

10.4 

amoeba 

minimize in ^-dimensions by downhill simplex method 

10.4 

amotry 

evaluate a trial point, used by amoeba 

10.5 

powell 

minimize in ^-dimensions by Powell’s method 

10.5 

linmin 

minimum of a function along a ray in 7V-dimensions 

10.5 

f ldim 

function used by linmin 

10.6 

frprmn 

minimize in A'-dimensions by conjugate gradient 

10.6 

dlinmin 

minimum of a function along a ray using derivatives 

10.6 

dfldim 

function used by dlinmin 

10.7 

dfpmin 

minimize in A'-dimensions by variable metric method 

10.8 

simplx 

linear programming maximization of a linear function 

10.8 

simpl 

linear programming, used by simplx 

10.8 

simp2 

linear programming, used by simplx 

10.8 

simp3 

linear programming, used by simplx 

10.9 

anneal 

traveling salesman problem by simulated annealing 

10.9 

revest 

cost of a reversal, used by anneal 

10.9 

reverse 

do a reversal, used by anneal 

10.9 

trncst 

cost of a transposition, used by anneal 

10.9 

trnspt 

do a transposition, used by anneal 

10.9 

metrop 

Metropolis algorithm, used by anneal 

10.9 

amebsa 

simulated annealing in continuous spaces 

10.9 

amotsa 

evaluate a trial point, used by amebsa 

11.1 

jacobi 

eigenvalues and eigenvectors of a symmetric matrix 

11.1 

eigsrt 

eigenvectors, sorts into order by eigenvalue 

11.2 

tred2 

Householder reduction of a real, symmetric matrix 

11.3 

tqli 

eigensolution of a symmetric tridiagonal matrix 

11.5 

balanc 

balance a nonsymmetric matrix 

11.5 

elmhes 

reduce a general matrix to Hessenberg form 

11.6 

hqr 

eigenvalues of a Hessenberg matrix 

12.2 

fourl 

fast Fourier transform (FFT) in one dimension 

12.3 

twofft 

fast Fourier transform of two real functions 

12.3 

realft 

fast Fourier transform of a single real function 

12.3 

sinft 

fast sine transform 

12.3 

cosftl 

fast cosine transform with endpoints 

12.3 

cosft2 

“staggered” fast cosine transform 
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12.4 fourn fast Fourier transform in multidimensions 

12.5 rlft3 FFT of real data in two or three dimensions 

12.6 fourf s FFT for huge data sets on external media 

12.6 fourew rewind and permute files, used by fourf s 

13.1 convlv convolution or deconvolution of data using FFT 

13.2 correl correlation or autocorrelation of data using FFT 

13.4 spctrm power spectrum estimation using FFT 

13.6 memcof evaluate maximum entropy (MEM) coefficients 

13.6 f ixrts reflect roots of a polynomial into unit circle 

13.6 predic linear prediction using MEM coefficients 

13.7 evlmem power spectral estimation from MEM coefficients 

13.8 period power spectrum of unevenly sampled data 

13.8 fasper power spectrum of unevenly sampled larger data sets 

13.8 spread extirpolate value into array, used by fasper 

13.9 dftcor compute endpoint corrections for Fourier integrals 

13.9 dftint high-accuracy Fourier integrals 

13.10 wtl one-dimensional discrete wavelet transform 

13.10 daub4 Daubechies 4-coefficient wavelet filter 

13.10 pwtset initialize coefficients for pwt 

13.10 pwt partial wavelet transform 

13.10 wtn multidimensional discrete wavelet transform 

14.1 moment calculate moments of a data set 

14.2 ttest Student’s t-test for difference of means 

14.2 avevar calculate mean and variance of a data set 

14.2 tutest Student’s t- test for means, case of unequal variances 

14.2 tptest Student’s t-test for means, case of paired data 

14.2 ftest F-test for difference of variances 

14.3 chsone chi-square test for difference between data and model 

14.3 chstwo chi-square test for difference between two data sets 

14.3 ksone Kolmogorov-Smimov test of data against model 

14.3 kstwo Kolmogorov-Smimov test between two data sets 

14.3 probks Kolmogorov-Smimov probability function 

14.4 cntabl contingency table analysis using chi-square 

14.4 cntab2 contingency table analysis using entropy measure 

14.5 pearsn Pearson’s correlation between two data sets 

14.6 spear Spearman’s rank correlation between two data sets 

14.6 crank replaces array elements by their rank 

14.6 kendll correlation between two data sets, Kendall’s tau 

14.6 kendl2 contingency table analysis using Kendall’s tau 

14.7 ks2dls K-S test in two dimensions, data vs. model 

14.7 quadct count points by quadrants, used by ks2dls 

14.7 quadvl quadrant probabilities, used by ks2dls 

14.7 ks2d2s K-S test in two dimensions, data vs. data 

14.8 savgol Savitzky-Golay smoothing coefficients 
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15.2 

fit 

least-squares fit data to a straight line 

15.3 

f itexy 

fit data to a straight line, errors in both x and y 

15.3 

chixy 

used by f itexy to calculate a x 2 

15.4 

If it 

general linear least-squares fit by normal equations 

15.4 

covsrt 

rearrange covariance matrix, used by If it 

15.4 

svdfit 

linear least-squares fit by singular value decomposition 

15.4 

svdvar 

variances from singular value decomposition 

15.4 

fpoly 

fit a polynomial using If it or svdf it 

15.4 

fleg 

fit a Legendre polynomial using If it or svdf it 

15.5 

mrqmin 

nonlinear least-squares fit, Marquardt’s method 

15.5 

mrqcof 

used by mrqmin to evaluate coefficients 

15.5 

fgauss 

fit a sum of Gaussians using mrqmin 

15.7 

medf it 

fit data to a straight line robustly, least absolute deviation 

15.7 

rofunc 

fit data robustly, used by medf it 

16.1 

rk4 

integrate one step of ODEs, fourth-order Runge-Kutta 

16.1 

rkdumb 

integrate ODEs by fourth-order Runge-Kutta 

16.2 

rkqs 

integrate one step of ODEs with accuracy monitoring 

16.2 

rkck 

Cash-Karp-Runge-Kutta step used by rkqs 

16.2 

odeint 

integrate ODEs with accuracy monitoring 

16.3 

mmid 

integrate ODEs by modified midpoint method 

16.4 

bsstep 

integrate ODEs, Bulirsch-Stoer step 

16.4 

pzextr 

polynomial extrapolation, used by bsstep 

16.4 

rzextr 

rational function extrapolation, used by bsstep 

16.5 

stoerm 

integrate conservative second-order ODEs 

16.6 

stiff 

integrate stiff ODEs by fourth-order Rosenbrock 

16.6 

jacobn 

sample Jacobian routine for stiff 

16.6 

derivs 

sample derivatives routine for stiff 

16.6 

simpr 

integrate stiff ODEs by semi-implicit midpoint rule 

16.6 

stifbs 

integrate stiff ODEs, Bulirsch-Stoer step 

17.1 

shoot 

solve two point boundary value problem by shooting 

17.2 

shootf 

ditto, by shooting to a fitting point 

17.3 

solvde 

two point boundary value problem, solve by relaxation 

17.3 

bksub 

backsubstitution, used by solvde 

17.3 

pinvs 

diagonalize a sub-block, used by solvde 

17.3 

red 

reduce columns of a matrix, used by solvde 

17.4 

sfroid 

spheroidal functions by method of solvde 

17.4 

difeq 

spheroidal matrix coefficients, used by sfroid 

17.4 

sphoot 

spheroidal functions by method of shoot 

17.4 

sphfpt 

spheroidal functions by method of shootf 

18.1 

fred2 

solve linear Fredholm equations of the second kind 

18.1 

fredin 

interpolate solutions obtained with f red2 

18.2 

voltra 

linear Volterra equations of the second kind 

18.3 

wwghts 

quadrature weights for an arbitrarily singular kernel 

18.3 

kermom 

sample routine for moments of a singular kernel 
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18.3 

quadmx 

sample routine for a quadrature matrix 

18.3 

fredex 

example of solving a singular Fredholm equation 

19.5 

sor 

elliptic PDE solved by successive overrelaxation method 

19.6 

mglin 

linear elliptic PDE solved by multigrid method 

19.6 

rstrct 

half-weighting restriction, used by mglin, mgfas 

19.6 

interp 

bilinear prolongation, used by mglin, mgfas 

19.6 

addint 

interpolate and add, used by mglin 

19.6 

slvsml 

solve on coarsest grid, used by mglin 

19.6 

relax 

Gauss-Seidel relaxation, used by mglin 

19.6 

resid 

calculate residual, used by mglin 

19.6 

copy 

utility used by mglin, mgfas 

19.6 

fillO 

utility used by mglin 

19.6 

mgfas 

nonlinear elliptic PDE solved by multigrid method 

19.6 

relax2 

Gauss-Seidel relaxation, used by mgfas 

19.6 

slvsm2 

solve on coarsest grid, used by mgfas 

19.6 

lop 

applies nonlinear operator, used by mgfas 

19.6 

matadd 

utility used by mgfas 

19.6 

matsub 

utility used by mgfas 

19.6 

anorm2 

utility used by mgfas 

20.1 

machar 

diagnose computer’s floating arithmetic 

20.2 

igray 

Gray code and its inverse 

20.3 

icrcl 

cycbc redundancy checksum, used by icrc 

20.3 

icrc 

cyclic redundancy checksum 

20.3 

decchk 

decimal check digit calculation or verification 

20.4 

hufmak 

construct a Huffman code 

20.4 

hufapp 

append bits to a Huffman code, used by hufmak 

20.4 

hufenc 

use Huffman code to encode and compress a character 

20.4 

hufdec 

use Huffman code to decode and decompress a character 

20.5 

arcmak 

construct an arithmetic code 

20.5 

arcode 

encode or decode a character using arithmetic coding 

20.5 

arcsum 

add integer to byte string, used by arcode 

20.6 

mpops 

multiple precision arithmetic, simpler operations 

20.6 

mpmul 

multiple precision multiply, using FFT methods 

20.6 

mpinv 

multiple precision reciprocal 

20.6 

mpdiv 

multiple precision divide and remainder 

20.6 

mpsqrt 

multiple precision square root 

20.6 

mp2dfr 

multiple precision conversion to decimal base 

20.6 

mppi 

multiple precision example, compute many digits of 7r 
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Chapter 1. Preliminaries 

1.0 Introduction 


This book, like its predecessor edition, is supposed to teach you methods of 
numerical computing that are practical, efficient, and (insofar as possible) elegant. 
We presume throughout this book that you, the reader, have particular tasks that you 
want to get done. We view our job as educating you on how to proceed. Occasionally 
we may try to reroute you briefly onto a particularly beautiful side road; but by and 
large, we will guide you along main highways that lead to practical destinations. 

Throughout this book, you will find us fearlessly editorializing, telling you 
what you should and shouldn’t do. This prescriptive tone results from a conscious 
decision on our part, and we hope that you will not find it irritating. We do not 
claim that our advice is infallible! Rather, we are reacting against a tendency, in 
the textbook literature of computation, to discuss every possible method that has 
ever been invented, without ever offering a practical judgment on relative merit. We 
do, therefore, offer you our practical judgments whenever we can. As you gain 
experience, you will form your own opinion of how reliable our advice is. 

We presume that you are able to read computer programs in C, that being 
the language of this version of Numerical Recipes (Second Edition). The book 
Numerical Recipes in FORTRAN (Second Edition) is separately available, if you 
prefer to program in that language. Earlier editions of Numerical Recipes in Pascal 
and Numerical Recipes Routines and Examples in BASIC are also available; while 
not containing the additional material of the Second Edition versions in C and 
FORTRAN, these versions are perfectly serviceable if Pascal or BASIC is your 
language of choice. 

When we include programs in the text, they look like this: 

#include <math.h> 

#define RAD (3.14159265/180.0) 

void flmoon(int n, int nph, long *jd, float *frac) 

Our programs begin with an introductory comment summarizing their purpose and explaining 
their calling sequence. This routine calculates the phases of the moon. Given an integer n and 
a code nph for the phase desired (nph = 0 for new moon, 1 for first quarter, 2 for full, 3 for last 
quarter), the routine returns the Julian Day Number jd, and the fractional part of a day frac 
to be added to it, of the nth such phase since January, 1900. Greenwich Mean Time is assumed. 
{ 

void nrerror(char error_text []); 

int i; 

float am,as,c,t,t2,xtra; 



c=n+nph/4.0; 


1 


This is how we comment an individual 
line. 
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t=c/1236.85; 
t2=t*t; 

as=359.2242+29.105356*c; You aren't really intended to understand 

am=306.0253+385.816918*c+0.010730*t2; this algorithm, but it does work! 

* j d=2415020+28L*n+7L*nph; 

xtra=0.75933+1.53058868*c+((1.178e-4)-(1.55e-7)*t)*t2; 
if (nph ==011 nph == 2) 

xtra += (0.1734-3,93e-4*t)*sin(RAD*as)-0.4068*sin(RAD*am); 
else if (nph == 1 I I nph == 3) 

xtra += (0.1721-4.0e-4*t)*sin(RAD*as)-0.6280*sin(RAD*am); 
else nrerror("nph is unknown in flmoon"); This is how we will indicate error 

i=(int)(xtra >= 0.0 ? floor(xtra) : ceil(xtra-l.0)) ; conditions. 

*jd += i; 

*frac=xtra-i; 


If the syntax of the function definition above looks strange to you, then you are 
probably used to the older Kernighan and Ritchie (“K&R”) syntax, rather than that of 
the newer ANSI C. In this edition, we adopt ANSI C as our standard. You might want 
to look ahead to § 1.2 where AN SIC function prototypes are discus sed in more detail. 

Note our convention of handling all errors and exceptional cases with a statement 
like nrerror("some error message") ;. The function nrerrorO is part of a 
small file of utility programs, nrutil.c, listed in Appendix B at the back of the 
book. This Appendix includes a number of other utilities that we will describe later in 
this chapter. Function nr err or () prints the indicated error message to your stderr 
device (usually your terminal screen), and then invokes the function exit (), which 
terminates execution. The function exit () is in every C library we know of; but if 
you find it missing, you can modify nr err or () so that it does anything else that will 
halt execution. For example, you can have it pause for input from the keyboard, and 
then manually interrupt execution. In some applications, you will want to modify 
nr err or () to do more sophisticated error handling, for example to transfer control 
somewhere else, with an error flag or error code set. 

We will have more to say about the C programming language, its conventions 
and style, in §1.1 and §1.2. 

Computational Environment and Program Validation 

Our goal is that the programs in this book be as portable as possible, across 
different platforms (models of computer), across different operating systems, and 
across different C compilers. C was designed with this type of portability in 
mind. Nevertheless, we have found that there is no substitute for actually checking 
all programs on a variety of compilers, in the process uncovering differences in 
library structure or contents, and even occasional differences in allowed syntax. As 
surrogates for the large number of possible combinations, we have tested all the 
programs in this book on the combinations of machines, operating systems, and 
compilers shown on the accompanying table. More generally, the programs should 
run without modification on any compiler that implements the ANSI C standard, 
as described for example in Harbison and Steele’s excellent book [1 ]. With small 
modifications, our programs should run on any compiler that implements the older, 
de facto K&R standard [2], An example of the kind of trivial incompatibility to 
watch out for is that ANSI C requires the memory allocation functions mallocO 
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Tested Machines and Compilers 

Hardware 

O/S Version 

Compiler Version 

IBM PC compatible 486/33 

MS-DOS 5.0/Windows 3.1 

Microsoft C/C++ 7.0 

IBM PC compatible 486/33 

MS-DOS 5.0 

Borland C/C++ 2.0 

IBM RS/6000 

AIX 3.2 

IBM xlc 1.02 

DECstation 5000/25 

ULTRIX 4.2A 

CodeCenter (Saber) C 3.1.1 

DECsystem 5400 

ULTRIX 4.1 

GNU C Compiler 2.1 

Sun SPARCstation 2 

SunOS 4.1 

GNU C Compiler 1.40 

DECstation 5000/200 

ULTRIX 4.2 

DEC RISC C 2.1* 

Sun SPARCstation 2 

SunOS 4.1 

Sun cc 1.1* 

*compiler version does not fully implement ANSI C; only K&R validated 


and f ree () to be declared via the header stdlib .h; some older compilers require 
them to be declared with the header file malloc.h, while others regard them as 
inherent in the language and require no header file at all. 

In validating the programs, we have taken the program source code directly 
from the machine-readable form of the book’s manuscript, to decrease the chance 
of propagating typographical errors. “Driver” or demonstration programs that we 
used as part of our validations are available separately as the Numerical Recipes 
Example Book (C), as well as in machine-readable form. If you plan to use more 
than a few of the programs in this book, or if you plan to use programs in this book 
on more than one different computer, then you may find it useful to obtain a copy 
of these demonstration programs. 

Of course we would be foolish to claim that there are no bugs in our programs, 
and we do not make such a claim. We have been very careful, and have benefitted 
from the experience of the many readers who have written to us. If you find a new 
bug, please document it and tell us! 

Compatibility with the First Edition 

If you are accustomed to the Numerical Recipes routines of the First Edition, rest 
assured: almost all of them are still here, with the same names and functionalities, 
often with major improvements in the code itself. In addition, we hope that you 
will soon become equally familiar with the added capabilities of the more than 100 
routines that are new to this edition. 

We have retired a small number of First Edition routines, those that we believe 
to be clearly dominated by better methods implemented in this edition. A table, 
following, lists the retired routines and suggests replacements. 

First Edition users should also be aware that some routines common to both 
editions have alterations in their calling interfaces, so are not directly “plug compat¬ 
ible.” A fairly complete list is: chsone, chstwo, covsrt, dfpmin, laguer, If it, 
memcof, mrqcof, mrqmin, pzextr, ran4, realf t, rzextr, shoot, shootf. There 
may be others (depending in part on which printing of the First Edition is taken 
for the comparison). If you have written software of any appreciable complexity 
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1 Previous Routines Omitted from This Edition | 

Name(s) 

Replacement(s) j 

Comment 

adi 

mglin or mgf as 

better method 

cosft 

cosftl or cosft2 

choice of boundary conditions 

cel, el2 

rf, rd, rj, rc 

better algorithms 

des, desks 

ran4 now uses psdes 

was too slow 

mdianl, mdian2 

select,selip 

more general 

qcksrt 

sort 

name change (sort is now hpsort) 

rkqc 

rkqs 

better method 

smooft 

use convlv with coefficients from savgol 

sparse 

linbcg 

more general | 


that is dependent on First Edition routines, we do not recommend blindly replacing 
them by the corresponding routines in this book. We do recommend that any new 
programming efforts use the new routines. 

About References 

You will find references, and suggestions for further reading, listed at the 
end of most sections of this book. References are cited in the text by bracketed 
numbers like this [3], 

Because computer algorithms often circulate informally for quite some time 
before appearing in a published form, the task of uncovering “primary literature” 
is sometimes quite difficult. We have not attempted this, and we do not pretend 
to any degree of bibliographical completeness in this book. For topics where a 
substantial secondary literature exists (discussion in textbooks, reviews, etc.) we 
have consciously limited our references to a few of the more useful secondary 
sources, especially those with good references to the primary literature. Where the 
existing secondary literature is insufficient, we give references to a few primary 
sources that are intended to serve as starting points for further reading, not as 
complete bibliographies for the field. 

The order in which references are listed is not necessarily significant. It reflects a 
compromise between listing cited references in the order cited, and listing suggestions 
for further reading in a roughly prioritized order, with the most useful ones first. 

The remaining three sections of this chapter review some basic concepts of 
programming (control structures, etc.), discuss a set of conventions specific to C 
that we have adopted in this book, and introduce some fundamental concepts in 
numerical analysis (roundoff error, etc.). Thereafter, we plunge into the substantive 
material of the book. 

CITED REFERENCES AND FURTHER READING: 

Harbison, S.R, and Steele, G.L., Jr. 1991, C: A Reference Manual , 3rd ed. (Englewood Cliffs, 

NJ: Prentice-Hall). [1] 
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Kernighan, B., and Ritchie, D. 1978, The C Programming Language (Englewood Cliffs, NJ: 
Prentice-Hall). [2] [Reference for K&R “traditional” C. Later editions of this book conform 
to the ANSI C standard.] 

Meeus, J. 1982, Astronomical Formulae for Calculators, 2nd ed., revised and enlarged (Rich¬ 
mond, VA: Willmann-Bell). [3] 


1.1 Program Organization and Control 
Structures 

We sometimes like to point out the close analogies between computer programs, 
on the one hand, and written poetry or written musical scores, on the other. All 
three present themselves as visual media, symbols on a two-dimensional page or 
computer screen. Yet, in all three cases, the visual, two-dimensional, frozen-in-time 
representation communicates (or is supposed to communicate) something rather 
different, namely a process that unfolds in time. A poem is meant to be read; music, 
played; a program, executed as a sequential series of computer instructions. 

In all three cases, the target of the communication, in its visual form, is a human 
being. The goal is to transfer to him/her, as efficiently as can be accomplished, 
the greatest degree of understanding, in advance, of how the process will unfold in 
time. In poetry, this human target is the reader. In music, it is the performer. In 
programming, it is the program user. 

Now, you may object that the target of communication of a program is not 
a human but a computer, that the program user is only an irrelevant intermediary, 
a lackey who feeds the machine. This is perhaps the case in the situation where 
the business executive pops a diskette into a desktop computer and feeds that 
computer a black-box program in binary executable form. The computer, in this 
case, doesn’t much care whether that program was written with “good programming 
practice” or not. 

We envision, however, that you, the readers of this book, are in quite a different 
situation. You need, or want, to know not just what a program does, but also how 
it does it, so that you can tinker with it and modify it to your particular application. 
You need others to be able to see what you have done, so that they can criticize or 
admire. In such cases, where the desired goal is maintainable or reusable code, the 
targets of a program’s communication are surely human, not machine. 

One key to achieving good programming practice is to recognize that pro¬ 
gramming, music, and poetry — all three being symbolic constructs of the human 
brain — are naturally structured into hierarchies that have many different nested 
levels. Sounds (phonemes) form small meaningful units (morphemes) which in ton 
form words; words group into phrases, which group into sentences; sentences make 
paragraphs, and these are organized into higher levels of meaning. Notes form 
musical phrases, which form themes, counterpoints, harmonies, etc.; which form 
movements, which form concertos, symphonies, and so on. 

The structure in programs is equally hierarchical. Appropriately, good program¬ 
ming practice brings different techniques to bear on the different levels [1 -3]. At a low 
level is the ascii character set. Then, constants, identifiers, operands, operators. 
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Then program statements, like a[j+l]=b+c/3.0;. Here, the best programming 
advice is simply be clear, or (correspondingly) don’t be too tricky. You might 
momentarily be proud of yourself at writing the single line 


k=(2-j)*(l+3*j)/2; 


if you want to permute cyclically one of the values j = (0,1,2) into respectively 
k = (1,2,0). You will regret it later, however, when you try to understand that 
line. Better, and likely also faster, is 


k=j+l; 

if (k == 3) k=0; 


Many programming stylists would even argue for the ploddingly literal 

switch (j) { 

case 0: k=l; break; 
case 1: k=2; break; 
case 2: k=0; break; 
default: { 

fprintf(stderr,"unexpected value for j"); 
exit(1); 

1 



on the grounds that it is both clear and additionally safeguarded from wrong assump¬ 
tions about the possible values of j. Our preference among the implementations 
is for the middle one. 

In this simple example, we have in fact traversed several levels of hierarchy: 
Statements frequently come in “groups” or “blocks” which make sense only taken 
as a whole. The middle fragment above is one example. Another is 


swap=a[j]; 
a[j]=b[j] ; 
b[j]=swap; 


which makes immediate sense to any programmer as the exchange of two variables, 
while 


ans=sum=0.0; 
n=l; 

is very likely to be an initialization of variables prior to some iterative process. This 
level of hierarchy in a program is usually evident to the eye. It is good programming 
practice to put in comments at this level, e.g., “initialize” or “exchange variables.” 

The next level is that of control structures. These are things like the switch 
construction in the example above, for loops, and so on. This level is sufficiently 
important, and relevant to the hierarchical level of the routines in this book, that 
we will come back to it just below. 

At still higher levels in the hierarchy, we have functions and modules, and the 
whole “global” organization of the computational task to be done. In the musical 
analogy, we are now at the level of movements and complete works. At these levels, 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 






1.1 Program Organization and Control Structures 


7 


modularization and encapsulation become important programming concepts, the 
general idea being that program units should interact with one another only through 
clearly defined and narrowly circumscribed interfaces. Good modularization practice 
is an essential prerequisite to the success of large, complicated software projects, 
especially those employing the efforts of more than one programmer. It is also good 
practice (if not quite as essential) in the less massive programming tasks that an 
individual scientist, or reader of this book, encounters. 

Some computer languages, such as Modula-2 and C++, promote good modular¬ 
ization with higher-level language constructs absent in C. In Modula-2, for example, 
functions, type definitions, and data structures can be encapsulated into “modules” 
that communicate through declared public interfaces and whose internal workings 
are hidden from the rest of the program [4], In the C++ language, the key concept 
is “class,” a user-definable generalization of data type that provides for data hiding, 
automatic initialization of data, memory management, dynamic typing, and operator 
overloading (i.e., the user-definable extension of operators like + and * so as to be 
appropriate to operands in any particular class) [5], Properly used in defining the data 
structures that are passed between program units, classes can clarify and circumscribe 
these units’ public interfaces, reducing the chances of programming error and also 
allowing a considerable degree of compile-time and run-time error checking. 

Beyond modularization, though depending on it, lie the concepts of object- 
oriented programming. Here a programming language, such as C++ or Turbo Pascal 
5.5 [6], allows a module’s public interface to accept redefinitions of types or actions, 
and these redefinitions become shared all the way down through the module’s 
hierarchy (so-called polymorphism ). For example, a routine written to invert a matrix 
of real numbers could — dynamically, at run time — be made able to handle complex 
numbers by overloading complex data types and corresponding definitions of the 
arithmetic operations. Additional concepts of inheritance (the ability to define a data 
type that “inherits” all the structure of another type, plus additional structure of its 
own), and object extensibility (the ability to add functionality to a module without 
access to its source code, e.g., at run time), also come into play. 

We have not attempted to modularize, or make objects out of, the routines in this 
book, for at least two reasons. First, the chosen language, C, does not really make 
this possible. Second, we envision that you, the reader, might want to incorporate 
the algorithms in this book, a few at a time, into modules or objects with a structure 
of your own choosing. There does not exist, at present, a standard or accepted set 
of “classes” for scientific object-oriented computing. While we might have tried to 
invent such a set, doing so would have inevitably tied the algorithmic content of the 
book (which is its raison d’etre) to some rather specific, and perhaps haphazard, set 
of choices regarding class definitions. 

On the other hand, we are not unfriendly to the goals of modular and object- 
oriented programming. Within the limits of C, we have therefore tried to structure 
our programs to be “object friendly.” That is one reason we have adopted ANSI 
C with its function prototyping as our default C dialect (see §1.2). Also, within 
our implementation sections, we have paid particular attention to the practices of 
structured programming , as we now discuss. 
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Control Structures 

An executing program unfolds in time, but not strictly in the linear order in 
which the statements are written. Program statements that affect the order in which 
statements are executed, or that affect whether statements are executed, are called 
control statements. Control statements never make useful sense by themselves. They 
make sense only in the context of the groups or blocks of statements that they in turn 
control. If you think of those blocks as paragraphs containing sentences, then the 
control statements are perhaps best thought of as the indentation of the paragraph 
and the punctuation between the sentences, not the words within the sentences. 

We can now say what the goal of structured programming is. It is to make 
program control manifestly apparent in the visual presentation of the program. You 
see that this goal has nothing at all to do with how the computer sees the program. 
As already remarked, computers don’t care whether you use structured programming 
or not. Human readers, however, do care. You yourself will also care, once you 
discover how much easier it is to perfect and debug a well-structured program than 
one whose control structure is obscure. 

You accomplish the goals of structured programming in two complementary 
ways. First, you acquaint yourself with the small number of essential control 
structures that occur over and over again in programming, and that are therefore 
given convenient representations in most programming languages. You should learn 
to think about your programming tasks, insofar as possible, exclusively in terms of 
these standard control structures. In writing programs, you should get into the habit 
of representing these standard control structures in consistent, conventional ways. 

“Doesn’t this inhibit creativity ?” our students sometimes ask. Yes, just 
as Mozart’s creativity was inhibited by the sonata form, or Shakespeare’s by the 
metrical requirements of the sonnet. The point is that creativity, when it is meant to 
communicate, does well under the inhibitions of appropriate restrictions on format. 

Second, you avoid, insofar as possible, control statements whose controlled 
blocks or objects are difficult to discern at a glance. This means, in practice, that you 
must try to avoid named labels on statements and goto’s. It is not the goto’s that 
are dangerous (although they do interrupt one’s reading of a program); the named 
statement labels are the hazard. In fact, whenever you encounter a named statement 
label while reading a program, you will soon become conditioned to get a sinking 
feeling in the pit of your stomach. Why? Because the following questions will, by 
habit, immediately spring to mind: Where did control com e from in a branch to this 
label? It could be anywhere in the routine! What circumstances resulted in a branch 
to this label? They could be anything! Certainty becomes uncertainty, understanding 
dissolves into a morass of possibilities. 

Some examples are now in order to make these considerations more concrete 
(see Figure 1.1.1). 

Catalog of Standard Structures 

Iteration. In C, simple iteration is performed with a for loop, for example 

for (j =2;j <=1000;j ++) { 
b[j]=a[j-l] ; 

> 
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block 


increment 

index 


r~ 

FOR iteration 

(a) 



DO WHILE iteration 



(b) 



BREAK iteration 


(c) 


(d) 



Figure 1.1.1. Standard control structures used in structured programming: (a) for iteration; (b) while 
iteration; (c) do while iteration; (d) break iteration; (e) if structure; (f) switch structure 
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Notice how we always indent the block of code that is acted upon by the control 
structure, leaving the structure itself unindented. Notice also our habit of putting the 
initial curly brace on the same line as the for statement, instead of on the next line. 
This saves a full line of white space, and our publisher loves us for it. 


IF structure. This structure in C is similar to that found in Pascal, Algol, 
FORTRAN and other languages, and typically looks like 

if (...) { 

> 

else if (...) { 

} 

else { 

> 



Since compound-statement curly braces are required only when there is more 
than one statement in a block, however, C’s if construction can be somewhat less 
explicit than the corresponding structure in FORTRAN or Pascal. Some care must be 
exercised in constructing nested if clauses. For example, consider the following: 

if (b > 3) 

if (a > 3) b += 1; 

else b -= 1; /* questionable! */ 

As judged by the indentation used on successive lines, the intent of the writer of 
this code is the following: ‘If b is greater than 3 and a is greater than 3, then 
increment b. If b is not greater than 3, then decrement b.’ According to the rules 
of C, however, the actual meaning is ‘If b is greater than 3, then evaluate a. If a is 
greater than 3, then increment b, and if a is less than or equal to 3, decrement b.’ The 
point is that an else clause is associated with the most recent open if statement, 
no matter how you lay it out on the page. Such confusions in meaning are easily 
resolved by the inclusion of braces. They may in some instances be technically 
superfluous; nevertheless, they clarify your intent and improve the program. The 
above fragment should be written as 

if (b > 3) { 

if (a > 3) b += 1; 

> else { 
b -= 1; 

} 

Here is a working program that consists dominantly of if control statements: 

#include <math.h> 

#def ine IGREG (15+31L*(10+12L*1582)) Gregorian Calendar adopted Oct. 15,1582. 

long julday(int mm, int id, int iyyy) 

In this routine julday returns the Julian Day Number that begins at noon of the calendar date 
specified by month mm, day id, and year iyyy, all integer variables. Positive year signifies A.D.; 
negative, B.C. Remember that the year after 1 B.C. was 1 A.D. 

{ 

void nrerror(char error_text []); 
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long jul; 

int ja,jy=iyyy,jm; 

if (jy == 0) nrerror("julday: there is no year zero."); 
if (jy < 0) ++jy; 

if (mm > 2) { Here is an example of a block IF-structure. 

jm=mm+l; 

> else { 

--jy; 

jm=mm+13; 

> 

jul = (long) (floor(365.25*jy)+floor(30.6001*jm)+id+1720995); 
if (id+31L*(mm+12L*iyyy) >= IGEEG) { Test whether to change to Gregorian Cal- 

ja= (int) (0.01*jy) ; endar. 

jul += 2-ja+(int) (0.25*ja); 

> 

return jul; 


(Astronomers number each 24-hour period, starting and ending at noon, with 
a unique integer, the Julian Day Number [7], Julian Day Zero was a very long 
time ago; a convenient reference point is that Julian Day 2440000 began at noon 
of May 23, 1968. If you know the Julian Day Number that begins at noon of a 
given calendar date, then the day of the week of that date is obtained by adding 
1 and taking the result modulo base 7; a zero answer corresponds to Sunday, 1 to 
Monday, ..., 6 to Saturday.) 

While iteration. Most languages (though not FORTRAN, incidentally) provide 

for structures like the following C example: 

while (n < 1000) { 
n *= 2; 

j += 1; 

> 

It is the particular feature of this structure that the control-clause (in this case 
n < 1000) is evaluated before each iteration. If the clause is not true, the enclosed 
statements will not be executed. In particular, if this code is encountered at a time 
when n is greater than or equal to 1000, the statements will not even be executed once. 

Do-While iteration. Companion to the while iteration is a related control- 
structure that tests its control-clause at the end of each iteration. In C, it looks 
like this: 

do { 

n *= 2; 

j += l; 

> while (n < 1000); 

In this case, the enclosed statements will be executed at least once, independent 
of the initial value of n. 

Break. In this case, you have a loop that is repeated indefinitely until some 
condition tested somewhere in the middle of the loop (and possibly tested in more 
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than one place) becomes true. At that point you wish to exit the loop and proceed 
with what comes after it. In C the structure is implemented with the simple break 
statement, which terminates execution of the innermost for, while, do, or switch 
construction and proceeds to the next sequential instruction. (In Pascal and standard 
FORTRAN, this structure requires the use of statement labels, to the detriment of clear 
programming.) A typical usage of the break statement is: 

for(;;) { 

[statements before the test] 
if (...) break; 

[statements after the test] 

} 

[next sequential instruction] 


Here is a program that uses several different iteration structures. One of us was 
once asked, for a scavenger hunt, to find the date of a Friday the 13th on which the 
moon was full. This is a program which accomplishes that task, giving incidentally 
all other Fridays the 13th as a by-product. 


#include <stdio.h> 
#include <math.h> 
#define ZON -5.0 
#define IYBEG 1900 
#define IYEND 2000 


Time zone —5 is Eastern Standard Time. 
The range of dates to be searched. 


int main(void) /* Program badluk */ 

i 

void flmoon(int n, int nph, long *jd, float *frac); 

long julday(int mm, int id, int iyyy); 

int ic,icon,idwk,im,iyyy,n; 

float timzon = Z0N/24.0,frac; 

long jd,jday; 


printf ("\nFull moons on Friday the 13th from %5d to "/,5d\n",IYBEG,IYEND); 
for (iyyy=IYBEG;iyyy<=IYEMD;iyyy++) { Loop over each year, 
for (im=l;im<=12;im++) { and each month. 

jday=julday(im, 13,iyyy); Is the 13th a Friday? 

idwk=(int) ((jday+1) "/, 7); 
if (idwk == 5) { 

n=(int)(12.37*(iyyy-1900+(im-0.5)/12.0)); 

This value n is a first approximation to how many full moons have occurred 
since 1900. We will feed it into the phase routine and adjust it up or down 
until we determine that our desired 13th was or was not a full moon. The 
variable icon signals the direction of adjustment. 


icon=0; 
for (;;) { 

flmoon(n,2,&jd,&frac); 
frac=24.0*(frac+timzon); 
if (frac < 0.0) { 

—jd; 

frac += 24.0; 

} 


Get date of full moon n. 

Convert to hours in correct time zone. 
Convert from Julian Days beginning at 
noon to civil days beginning at mid¬ 
night. 


if (frac > 12.0) { 


++jd; 

frac -= 12.0; 

> else 

frac += 12.0; 

if (jd == jday) { Did we hit our target day? 

printf (" \n"/ 0 2d/13/’/,4d\n" , im, iyyy); 
printf ("’/,s ’/.5.1f %s\n", "Full moon", frac, 
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" hrs after midnight (EST)"); 
break; Part of the break-structure, a match. 

> else { Didn't hit it. 

ic=(jday >= jd ? 1 : -1); 

if (ic == (-icon)) break; Another break, case of no match. 


> 

> 

> 

> 

> 

return 0; 

> 

If you are merely curious, there were (or will be) occurrences of a full moon 
on Friday the 13th (time zone GMT—5) on: 3/13/1903, 10/13/1905, 6/13/1919, 
1/13/1922, 11/13/1970, 2/13/1987, 10/13/2000, 9/13/2019, and 8/13/2049. 

Other “standard” structures. Our advice is to avoid them. Every pro¬ 
gramming language has some number of “goodies” that the designer just couldn’t 
resist throwing in. They seemed like a good idea at the time. Unfortunately they 
don’t stand the test of time! Your program becomes difficult to translate into other 
languages, and difficult to read (because rarely used structures are unfamiliar to the 
reader). You can almost always accomplish the supposed conveniences of these 
structures in other ways. 

In C, the most problematic control structure is the switch...case...default 
construction (see Figure 1.1.1), which has historically been burdened by uncertainty, 
from compiler to compiler, about what data types are allowed in its control expression. 
Data types char and int are universally supported. For other data types, e.g., float 
or double, the structure should be replaced by a more recognizable and translatable 
if.. .else construction. ANSI C allows the control expression to be of type long, 
but many older compilers do not. 

The continue; construction, while benign, can generally be replaced by an 
if construction with no loss of clarity. 

About “Advanced Topics” 

Material set in smaller type, like this, signals an “advanced topic,” either one outside of 
the main argument of the chapter, or else one requiring of you more than the usual assumed 
mathematical background, or else (in a few cases) a discussion that is more speculative or an 
algorithm that is less well-tested. Nothing important will be lost if you skip the advanced 
topics on a first reading of the book. 

You may have noticed that, by its looping over the months and years, the program badluk 
avoids using any algorithm for converting a Julian Day Number back into a calendar date. A 
routine for doing just this is not very interesting structurally, but it is occasionally useful: 

#include <math.h> 

#define IGREG 2299161 

void caldat(long julian, int *mm, int *id, int *iyyy) 

Inverse of the function julday given above. Here julian is input as a Julian Day Number, 
and the routine outputs mm,id, and iyyy as the month, day, and year on which the specified 
Julian Day started at noon. 

{ 

long ja,jalpha,jb,jc,jd,je; 
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if (julian >= IGREG) { Cross-over to Gregorian Calendar produces this correc- 

jalpha=(long)(((double) (julian-1867216)-0.25)/36524.25); tion. 
ja=julian+l+jalpha-(long) (0.25*jalpha); 

> else if (julian < 0) { Make day number positive by adding integer number of 

ja=julian+36525*(l-julian/36525) ; Julian centuries, then subtract them off 

> else at the end. 

ja=julian; 

jb=ja+1524; 

jc=(long)(6680.0+((double) (jb-2439870)-122.l)/365.25); 
jd=(long)(365*jc+(0.25*jc)); 
j e=(long)((jb-j d) /30.6001); 

*id=jb-jd-(long) (30.6001*je); 

*mm=je-l; 

if (*nmi > 12) *mm -= 12; 

*iyyy=jc-4715; 

if (*nm > 2) —(*iyyy); 

if (*iyyy <= 0) —(*iyyy); 

if (julian < 0) *iyyy -= 100*(l-julian/36525); 


(For additional calendrical algorithms, applicable to various historical calendars, see [8].) 

CITED REFERENCES AND FURTHER READING: 

Harbison, S.R, and Steele, G.L., Jr. 1991, C: A Reference Manual, 3rd ed. (Englewood Cliffs, 
NJ: Prentice-Hall). 

Kernighan, B.W. 1978, The Elements of Programming Style (New York: McGraw-Hill). [1] 

Yourdon, E. 1975, Techniques of Program Structure and Design (Englewood Cliffs, NJ: Prentice- 
Hall). [2] 

Jones, R., and Stewart, I. 1987, The Art of C Programming (New York: Springer-Verlag). [3] 

Hoare, C.A.R. 1981, Communications of the ACM, vol. 24, pp. 75-83. 

Wirth, N. 1983, Programming in Modula-2, 3rd ed. (New York: Springer-Verlag). [4] 

Stroustrup, B. 1986, The C++ Programming Language (Reading, MA: Addison-Wesley). [5] 

Borland International, Inc. 1989, Turbo Pascal 5.5 Object-Oriented Programming Guide (Scotts 
Valley, CA: Borland International). [6] 

Meeus, J. 1982, Astronomical Formulae for Calculators, 2nd ed., revised and enlarged (Rich¬ 
mond, VA: Willmann-Bell). [7] 

Hatcher, D.A. 1984, Quarterly Journal of the Royal Astronomical Society, vol. 25, pp. 53-55; see 
also op. cit. 1985, vol. 26, pp. 151-155, and 1986, vol. 27, pp. 506-507. [8] 


1.2 Some C Conventions for Scientific 
Computing 

The C language was devised originally for systems programming work, not for 
scientific computing. Relative to other high-level programming languages, C puts 
the programmer “very close to the machine” in several respects. It is operator-rich, 
giving direct access to most capabilities of a machine-language instruction set. It 
has a large variety of intrinsic data types (short and long, signed and unsigned 
integers; floating and double-precision reals; pointer types; etc.), and a concise 
syntax for effecting conversions and indirections. It defines an arithmetic on pointers 
(addresses) that relates gracefully to array addressing and is highly compatible with 
the index register structure of many computers. 
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Portability has always been another strong point of the C language. C is the 
underlying language of the UNIX operating system; both the language and the 
operating system have by now been implemented on literally hundreds of different 
computers. The language’s universality, portability, and flexibility have attracted 
increasing numbers of scientists and engineers to it. It is commonly used for the 
real-time control of experimental hardware, often in spite of the fact that the standard 
UNIX kernel is less than ideal as an operating system for this purpose. 

The use of C for higher level scientific calculations such as data analysis, 
modeling, and floating-point numerical work has generally been slower in developing. 
In part this is due to the entrenched position of FORTRAN as the mother-tongue of 
virtually all scientists and engineers born before 1960, and most born after. In 
part, also, the slowness of C’s penetration into scientific computing has been due to 
deficiencies in the language that computer scientists have been (we think, stubbornly) 
slow to recognize. Examples are the lack of a good way to raise numbers to small 
integer powers, and the “implicit conversion of float to double” issue, discussed 
below. Many, though not all, of these deficiencies are overcome in the ANSI C 
Standard. Some remaining deficiencies will undoubtedly disappear over time. 

Yet another inhibition to the mass conversion of scientists to the C cult has been, 
up to the time of writing, the decided lack of high-quality scientific or numerical 
libraries. That is the lacuna into which we thrust this edition of Numerical Recipes. 
We certainly do not claim to be a complete solution to the problem. We do hope 
to inspire further efforts, and to lay out by example a set of sensible, practical 
conventions for scientific C programming. 

The need for programming conventions in C is very great. Far from the problem 
of overcoming constraints imposed by the language (our repeated experience with 
Pascal), the problem in C is to choose the best and most natural techniques from 
multiple opportunities — and then to use those techniques completely consistently 
from program to program. In the rest of this section, we set out some of the issues, 
and describe the adopted conventions that are used in all of the routines in this book. 

Function Prototypes and Header Files 

ANSI C allows functions to be defined with function prototypes, which specify 
the type of each function parameter. If a function declaration or definition with 
a prototype is visible, the compiler can check that a given function call invokes 
the function with the correct argument types. All the routines printed in this book 
are in ANSI C prototype form. For the benefit of readers with older “traditional 
K&R” C compilers, the Numerical Recipes C Diskette includes two complete sets of 
programs, one in ANSI, the other in K&R. 

The easiest way to understand prototypes is by example. A function definition 
that would be written in traditional C as 

int g(x,y,z) 
int x, y; 
float z; 



becomes in ANSI C 
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int g(int x, int y, float z) 

A function that has no parameters has the parameter type list void. 

A function declaration (as contrasted to a function definition) is used to 
“introduce” a function to a routine that is going to call it. The calling routine needs 
to know the number and type of arguments and the type of the returned value. In 
a function declaration, you are allowed to omit the parameter names. Thus the 
declaration for the above function is allowed to be written 

int g(int, int, float); 

If a C program consists of multiple source files, the compiler cannot check the 
consistency of each function call without some additional assistance. The safest 
way to proceed is as follows: 

• Every external function should have a single prototype declaration in a 
header (.h) file. 

• The source file with the definition (body) of the function should also 
include the header file so that the compiler can check that the prototypes 
in the declaration and the definition match. 

• Every source file that calls the function should include the appropriate 
header (.h) file. 

• Optionally, a routine that calls a function can also include that function’s 
prototype declaration internally. This is often useful when you are 
developing a program, since it gives you a visible reminder (checked by 
the compiler through the common . h file) of a function’s argument types. 
Later, after your program is debugged, you can go back and delete the 
supernumary internal declarations. 

For the routines in this book, the header file containing all the prototypes is nr. h, 
listed in Appendix A. You should put the statement #include nr .h at the top of 
every source file that contains Numerical Recipes routines. Since, more frequently 
than not, you will want to include more than one Numerical Recipes routine in a 
single source file, we have not printed this #include statement in front of this 
book’s individual program listings, but you should make sure that it is present in 
your programs. 

As backup, and in accordance with the last item on the indented list above, 
we declare the function prototype of all Numerical Recipes routines that are called 
by other Numerical Recipes routines internally to the calling routine. (That also 
makes our routines much more readable.) The only exception to this rule is that 
the small number of utility routines that we use repeatedly (described below) are 
declared in the additional header file nrutil. h, and the line ffinclude nrutil. h 
is explicitly printed whenever it is needed. 

A final important point about the header file nr. h is that, as furnished on 
the diskette, it contains both ANSI C and traditional K&R-style declarations. The 

ANSI forms are invoked if any of the following macros are defined:_STDC_, 

ANSI, or NRANSI. (The purpose of the last name is to give you an invocation that 
does not conflict with other possible uses of the first two names.) If you have an 
ANSI compiler, it is essential that you invoke it with one or more of these macros 
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defined. The typical means for doing so is to include a switch like “-DANSI” on 
the compiler command line. 

Some further details about the file nr. h are given in Appendix A. 

Vectors and One-Dimensional Arrays 

There is a close, and elegant, correspondence in C between pointers and arrays. 
The value referenced by an expression like a[j] is defined to be *((a) + (j)), 
that is, “the contents of the address obtained by incrementing the pointer a by 
j.” A consequence of this definition is that if a points to a legal data location, 
the array element a [0] is always defined. Arrays in C are natively “zero-origin” 
or “zero-offset.” An array declared by the statement float b[4]; has the valid 
references b [0], b [1], b [2], and b [3], but not b [4]. 

Right away we need a notation to indicate what is the valid range of an array 
index. (The issue comes up about a thousand times in this book!) For the above 
example, the index range of b will be henceforth denoted b [0. . 3], a notation 
borrowed from Pascal. In general, the range of an array declared by float 
a [M] ; is a [0.. M — 1], and the same if float is replaced by any other data type. 

One problem is that many algorithms naturally like to go from 1 to M, not 
from 0 to M — 1. Sure, you can always convert them, but they then often acquire 
a baggage of additional arithmetic in array indices that is, at best, distracting. It is 
better to use the power of the C language, in a consistent way, to make the problem 
disappear. Consider 

float b[4],*bb; 
bb=b-l; 


The pointer bb now points one location before b. An immediate consequence is that 
the array elements bb [1], bb [2], bb [3], and bb [4] all exist. In other words the 
range of bb is bb [1. . 4]. We will refer to bb as a unit-offset vector. (See Appendix 
B for some additional discussion of technical details.) 

It is sometimes convenient to use zero-offset vectors, and sometimes convenient 
to use unit-offset vectors in algorithms. The choice should be whichever is most 
natural to the problem at hand. For example, the coefficients of a polynomial 
oo + a\x + a 2 X 2 + ... + a n x n clearly cry out for the zero-offset a [0. . n], while 
a vector of N data points x t . % = I ... N calls for a unit-offset x [1.. JV]. When a 
routine in this book has an array as an argument, its header comment always gives 
the expected index range. For example, 

void someroutine(float bb[], int nn) 

This routine does something with the vector bb[l. .nn] . 



Now, suppose you want someroutine () to do its thing on your own vector, 
of length 7, say. If your vector, call it aa, is already unit-offset (has the valid range 
aa[l. .7]), then you can invoke someroutine (aa, 7); in the obvious way. That is 
the recommended procedure, since someroutine () presumably has some logical, 
or at least aesthetic, reason for wanting a unit-offset vector. 
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But suppose that your vector of length 7, now call it a, is perversely a native C, 
zero-offset array (has range a [0.. 6]). Perhaps this is the case because you disagree 
with our aesthetic prejudices, Heaven help you! To use our recipe, do you have to 
copy a’s contents element by element into another, unit-offset vector? No! Do you 
have to declare a new pointer aaa and set it equal to a-1? No! You simply invoke 
someroutine(a-l,7)Then a[l], as seen from within our recipe, is actually 
a [0] as seen from your program. In other words, you can change conventions “on 
the fly” with just a couple of keystrokes. 

Forgive us for belaboring these points. We want to free you from the zero-offset 
thinking that C encourages but (as we see) does not require. A final liberating point 
is that the utility file nrutil.c, listed in full in Appendix B, includes functions 
for allocating (using mallocO) arbitrary-offset vectors of arbitrary lengths. The 
synopses of these functions are as follows: 

float *vector(long nl, long nh) 

Allocates a float vector with range [nl..nh], 

int *ivector(long nl, long nh) 

Allocates an int vector with range [nl..nh], 

unsigned char *cvector(long nl, long nh) 

Allocates an unsigned char vector with range [nl. .nh], 

unsigned long *lvector(long nl, long nh) 

Allocates an unsigned long vector with range [nl. .nh], 

double *dvector(long nl, long nh) 

Allocates a double vector with range [nl. .nh], 

A typical use of the above utilities is the declaration float *b; followed by 
b=vector (1,7) ;, which makes the range b [1. .7] come into existence and allows 
b to be passed to any function calling for a unit-offset vector. 

The file nrutil. c also contains the corresponding deallocation routines, 

void free_vector(float *v, long nl, long nh) 
void free_ivector(int *v, long nl, long nh) 
void free_cvector(unsigned char *v, long nl, long nh) 
void free_lvector(unsigned long *v, long nl, long nh) 
void free.dvector(double *v, long nl, long nh) 

with the typical use being free_vector(b,l,7);. 

Our recipes use the above utilities extensively for the allocation and deallocation 
of vector workspace. We also commend them to you for use in your main programs or 
other procedures. Note that if you want to allocate vectors of length longer than 64k 
on an IBM PC-compatible computer, you should replace all occurrences of malloc 
in nrutil. c by your compiler’s special-purpose memory allocation function. This 
applies also to matrix allocation, to be discussed next. 
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Matrices and Two-Dimensional Arrays 

The zero- versus unit-offset issue arises here, too. Let us, however, defer it for 
a moment in favor of an even more fundamental matter, that of variable dimension 
arrays (FORTRAN terminology) or conformant arrays (Pascal terminology). These 
are arrays that need to be passed to a function along with real-time information 
about their two-dimensional size. The systems programmer rarely deals with two- 
dimensional arrays, and almost never deals with two-dimensional arrays whose size 
is variable and known only at run time. Such arrays are, however, the bread and 
butter of scientific computing. Imagine trying to live with a matrix inversion routine 
that could work with only one size of matrix! 

There is no technical reason that a C compiler could not allow a syntax like 

void someroutine(a,m,n) 

float a[m][n]; /* ILLEGAL DECLARATION */ 

and emit code to evaluate the variable dimensions m and n (or any variable-dimension 
expression) each time someroutineO is entered. Alas! the above fragment is 
forbidden by the C language definition. The implementation of variable dimensions 
in C instead requires some additional finesse; however, we will see that one is 
rewarded for the effort. 

There is a subtle near-ambiguity in the C syntax for two-dimensional array 
references. Let us elucidate it, and then turn it to our advantage. Consider the 
array reference to a (say) float value a[i] [j], where i and j are expressions 
that evaluate to type int. A C compiler will emit quite different machine code for 
this reference, depending on how the identifier a has been declared. If a has been 
declared as a fixed-size array, e.g., float a [5] [9] ;, then the machine code is: “to 
the address a add 9 times i, then add j, return the value thus addressed.” Notice that 
the constant 9 needs to be known in order to effect the calculation, and an integer 
multiplication is required (see Figure 1.2.1). 

Suppose, on the other hand, that a has been declared by float **a;. Then 
the machine code for a[i] [j] is: “to the address of a add i, take the value thus 
addressed as a new address, add j to it, return the value addressed by this new 
address.” Notice that the underlying size of a [] [] does not enter this calculation 
at all, and that there is no multiplication; an additional indirection replaces it. We 
thus have, in general, a faster and more versatile scheme than the previous one. The 
price that we pay is the storage requirement for one array of pointers (to the rows 
of a [] []), and the slight inconvenience of remembering to initialize those pointers 
when we declare an array. 

Here is our bottom line: We avoid the fixed-size two-dimensional arrays of C as 
being unsuitable data structures for representing matrices in scientific computing. We 
adopt instead the convention “pointer to array of pointers,” with the array elements 
pointing to the first element in the rows of each matrix. Figure 1.2.1 contrasts the 
rejected and adopted schemes. 

The following fragment shows how a fixed-size array a of size 13 by 9 is 
converted to a “pointer to array of pointers” reference aa: 
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lines connect sequential memory locations, (a) Pointer to a fixed size two-dimensional array, (b) Pointer 
to an array of pointers to rows; this is the scheme adopted in this book. 

float a[13] [9] ,**aa; 
int i; 

aa=(float **) malloc((unsigned) 13*sizeof(float*)); 

for(i=0;i<=12;i++) aa[i]=a[i]; a[i] is a pointer to a[i] [0] 

The identifier aa is now a matrix with index range aa [0. . 12] [0. . 8] . You can use 
or modify its elements ad lib , and more importantly you can pass it as an argument 
to any function by its name aa. That function, which declares the corresponding 
dummy argument as float **aa;, can address its elements as aa[i] [j] without 
knowing its physical size. 

You may rightly not wish to clutter your programs with code like the above 
fragment. Also, there is still the outstanding problem of how to treat unit-offset 
indices, so that (for example) the above matrix aa could be addressed with the range 
a[l. . 13] [1. .9]. Both of these problems are solved by additional utility routines 
in nrutil.c (Appendix B) which allocate and deallocate matrices of arbitrary 
range. The synopses are 

float **matrix(long nrl, long nrh, long ncl, long nch) 

Allocates a float matrix with range [nrl. .nrh] [ncl. .nch] . 

double **dmatrix(long nrl, long nrh, long ncl, long nch) 

Allocates a double matrix with range [nrl. .nrh] [ncl. .nch] . 

int **imatrix(long nrl, long nrh, long ncl, long nch) 

Allocates an int matrix with range [nrl. .nrh] [ncl. .nch] . 

void free_matrix(float **m, long nrl, long nrh, long ncl, long nch) 

Frees a matrix allocated with matrix. 
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void free_dmatrix(double **m, long nrl, long nrh, long ncl, long nch) 
Frees a matrix allocated with dmatrix. 

void free_imatrix(int **m, long nrl, long nrh, long ncl, long nch) 
Frees a matrix allocated with imatrix. 


A typical use is 


float **a; 
a=matrix(l,13,1,9); 

a [3] [5] =. . . 

...+a[2] [9]/3.0... 
someroutine(a,...); 

free_matrix(a,1,13,1,9); 


All matrices in Numerical Recipes are handled with the above paradigm, and we 
commend it to you. 

Some further utilities for handling matrices are also included in nrutil.c. 
The first is a function submatrix () that sets up a new pointer reference to an 
already-existing matrix (or sub-block thereof), along with new offsets if desired. 
Its synopsis is 



float **submatrix(float **a, long oldrl, long oldrh, long oldcl, 
long oldch, long newrl, long newel) 

Point a submatrix [newrl. .newrl+(oldrh-oldrl)] [newel. .newcl+(oldch-oldcl)] to 
the existing matrix range a [oldrl. .oldrh] [oldcl. .oldch] . 


Here oldrl and oldrh are respectively the lower and upper row indices of the 
original matrix that are to be represented by the new matrix, oldcl and oldch are 
the corresponding column indices, and newrl and newel are the lower row and 
column indices for the new matrix. (We don’t need upper row and column indices, 
since they are implied by the quantities already given.) 

Two sample uses might be, first, to select as a 2 x 2 submatrix b[l. .2] 
[1. .2] some interior range of an existing matrix, say a [4. .5] [2. . 3], 


float **a,**b; 
a=matrix(1,13,1,9); 

b=submatrix(a,4,5,2,3,l,1); 



and second, to map an existing matrix a[l. . 13] [1. .9] into a new matrix HI 
b[0. .12] [0. .8], 5f'§' 


float **a,**b; 
a=matrix(l,13,l,9); 

b=submatrix(a,l,13,l,9,0,0); 
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Incidentally, you can use submatrixO for matrices of any type whose sizeof () 
is the same as sizeof (float) (often true for int, e.g.); just cast the first argument 
to type float ** and cast the result to the desired type, e.g., int **. 

The function 


void free_submatrix(float **b, long nrl, long nrh, long ncl, long nch) 

frees the array of row-pointers allocated by submatrix (). Note that it does not free 
the memory allocated to the data in the submatrix, since that space still lies within 
the memory allocation of some original matrix. 

Finally, if you have a standard C matrix declared as a[nrow] [ncol], and you 
want to convert it into a matrix declared in our pointer-to-row-of-pointers manner, 
the following function does the trick: 


float **convert_matrix(float *a, long nrl, long nrh, long ncl, long nch) 
Allocate a float matrixm[nrl. .nrh] [ncl. .nch] that points to the matrix declared in the 
standard C manner as a[nrow] [ncol] , where nrow=nrh-nrl+l and ncol=nch-ncl+l. The 
routine should be called with the address &a[0] [0] as the first argument. 

(You can use this function when you want to make use of C’s initializer syntax 
to set values for a matrix, but then be able to pass the matrix to programs in this 
book.) The function 


void free_convert_matrix(float **b, long nrl, long nrh, long ncl, long nch) 
Free a matrix allocated by convert_matrix(). 


frees the allocation, without affecting the original matrix a. 

The only examples of allocating a three -dimensional array as a pointer-to- 
pointer-to-pointer structure in this book are found in the routines rlf t3 in § 12.5 and 
sf roid in §17.4. The necessary allocation and deallocation functions are 


float ***f3tensor(long nrl, long nrh, long ncl, long nch, long ndl, long ndh) 
Allocate a float 3-dimensional array with subscript range [nrl. .nrh] [ncl. .nch] [ndl. .ndh] . 



void free_f3tensor(float ***t, long nrl, long nrh, long ncl, long nch, 
long ndl, long ndh) 

Free a float 3-dimensional array allocated by f3tensor(). 



Complex Arithmetic 


C does not have complex data types, or predefined arithmetic operations on 
complex numbers. That omission is easily remedied with the set of functions in 
the file complex. c which is printed in full in Appendix C at the back of the book. 
A synopsis is as follows: 
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typedef struct FCOMPLEX {float r, i j} fcomplex; 

fcomplex Cadd(fcomplex a, fcomplex b) 

Returns the complex sum of two complex numbers. 

fcomplex Csub(fcomplex a, fcomplex b) 

Returns the complex difference of two complex numbers. 

fcomplex Cmul(fcomplex a, fcomplex b) 

Returns the complex product of two complex numbers. 

fcomplex Cdiv(fcomplex a, fcomplex b) 

Returns the complex quotient of two complex numbers. 

fcomplex Csqrt(fcomplex z) 

Returns the complex square root of a complex number, 
fcomplex Conjg(fcomplex z) 

Returns the complex conjugate of a complex number, 
float Cabs(fcomplex z) 

Returns the absolute value (modulus) of a complex number, 
fcomplex Complex(float re, float im) 

Returns a complex number with specified real and imaginary parts, 
fcomplex RCmul(float x, fcomplex a) 

Returns the complex product of a real number and a complex number. 


The implementation of several of these complex operations in floating-point 
arithmetic is not entirely trivial; see §5.4. 

Only about half a dozen routines in this book make explicit use of these complex 
arithmetic functions. The resulting code is not as readable as one would like, because 
the familiar operations +-*/ are replaced by function calls. The C++ extension to 
the C language allows operators to be redefined. That would allow more readable 
code. However, in this book we are committed to standard C. 

We should mention that the above functions assume the ability to pass, return, 
and assign structures like FCOMPLEX (or types such as fcomplex that are defined 
to be structures) by value. All recent C compilers have this ability, but it is not in 
the original K&R C definition. If you are missing it, you will have to rewrite the 
functions in complex. c, making them pass and return pointers to variables of type 
fcomplex instead of the variables themselves. Likewise, you will need to modify 
the recipes that use the functions. 

Several other routines (e.g., the Fourier transforms fourl and fourn) do 
complex arithmetic “by hand,” that is, they carry around real and imaginary parts as 
float variables. This results in more efficient code than would be obtained by using 
the functions in complex. c. But the code is even less readable. There is simply no 
ideal solution to the complex arithmetic problem in C. 

Implicit Conversion of Float to Double 



In traditional (K&R) C, float variables are automatically converted to double 
before any operation is attempted, including both arithmetic operations and passing 
as arguments to functions. All arithmetic is then done in double precision. If a 
float variable receives the result of such an arithmetic operation, the high precision 
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is immediately thrown away. A corollary of these rules is that all the real-number 
standard C library functions are of type double and compute to double precision. 

The justification for these conversion rules is, “well, there’s nothing wrong with 
a little extra precision,” and “this way the libraries need only one version of each 
function.” One does not need much experience in scientific computing to recognize 
that the implicit conversion rules are, in fact, sheer madness! In effect, they make it 
impossible to write efficient numerical programs. One of the cultural barriers that 
separates computer scientists from “regular” scientists and engineers is a differing 
point of view on whether a 30% or 50% loss of speed is worth worrying about. In 
many real-time or state-of-the-art scientific applications, such a loss is catastrophic. 
The practical scientist is trying to solve tomorrow’s problem with yesterday’s 
computer; the computer scientist, we think, often has it the other way around. 

The ANSI C standard happily does not allow implicit conversion for arithmetic 
operations, but it does require it for function arguments, unless the function is fully 
prototyped by an ANSI declaration as described earlier in this section. That is 
another reason for our being rigorous about using the ANSI prototype mechanism, 
and a good reason for you to use an ANSI-compatible compiler. 

Some older C compilers do provide an optional compilation mode in which 
the implicit conversion of float to double is suppressed. Use this if you can. 
In this book, when we write float, we mean float; when we write double, 
we mean double, i.e., there is a good algorithmic reason for having higher 
precision. Our routines all can tolerate the traditional implicit conversion rules, 
but they are more efficient without them. Of course, if your application actually 
requires double precision, you can change our declarations from float to double 
without difficulty. (The brute force approach is to add a preprocessor statement 
#define float double !) 

A Few Wrinkles 



We like to keep code compact, avoiding unnecessary spaces unless they add 
immediate clarity. We usually don’t put space around the assignment operator “=”. 
Through a quirk of history, however, some C compilers recognize the (nonexistent) 
operator as being equivalent to the subtractive assignment operator and 
“=*” as being the same as the multiplicative assignment operator “*=”. That is why 
you will see us write y= -10.0; or y=(-10.0);, and y= *a; ory=(*a);. 

We have the same viewpoint regarding unnecessary parentheses. You can’t write 
(or read) C effectively unless you memorize its operator precedence and associativity 
rules. Please study the accompanying table while you brush your teeth every night. 

We never use the register storage class specifier. Good optimizing compilers 
are quite sophisticated in making their own decisions about what to keep in registers, 
and the best choices are sometimes rather counter-intuitive. 

Different compilers use different methods of distinguishing between defining 
and referencing declarations of the same external name in several files. We follow 
the most common scheme, which is also the ANSI standard. The storage class 
extern is explicitly included on all referencing top-level declarations. The storage 
class is omitted from the single defining declaration for each external variable. We 
have commented these declarations, so that if your compiler uses a different scheme 
you can change the code. The various schemes are discussed in §4.8 of [1], 
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Operator Precedence and Associativity Rules in C 

0 

function call 

left-to-right 

[] 

array element 



structure or union member 


-> 

pointer reference to structure 


I 

logical not 

right-to-left 

~ 

bitwise complement 


- 

unary minus 


++ 

increment 


— 

decrement 


k 

address of 


* 

contents of 


(type) 

cast to type 


sizeof 

size in bytes 


* 

multiply 

left-to-right 

/ 

divide 


i 

remainder 


+ 

add 

left-to-right 

- 

subtract 


« 

bitwise left shift 

left-to-right 

» 

bitwise right shift 


< 

arithmetic less than 

left-to-right 

> 

arithmetic greater than 


<= 

arithmetic less than or equal to 


>= 

arithmetic greater than or equal to 


== 

arithmetic equal 

left-to-right 

! = 

arithmetic not equal 


& 

bitwise and 

left-to-right 

- 

bitwise exclusive or 

left-to-right 

1 

bitwise or 

left-to-right 

kk 

logical and 

left-to-right 

II 

logical or 

left-to-right 

? : 

conditional expression 

right-to-left 

= 

assignment operator 

right-to-left 

also + 

= -= *= /= ’/.= 


«= 

»= &= -= | = 


, 

sequential expression 

left-to-right 


We have already alluded to the problem of computing small integer powers of 
numbers, most notably the square and cube. The omission of this operation from C 
is perhaps the language’s most galling insult to the scientific programmer. All good 
FORTRAN compilers recognize expressions like (A+B) **4 and produce in-line code, 
in this case with only one add and two multiplies. It is typical for constant integer 
powers up to 12 to be thus recognized. 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 




1.2 Some C Conventions for Scientific Computing 


27 


In C, the mere problem of squaring is hard enough! Some people “macro-ize” 
the operation as 

#define SQR(a) ((a)*(a)) 


However, this is likely to produce code where SQR(sin(x)) results in two calls to 
the sine routine! You might be tempted to avoid this by storing the argument of the 
squaring function in a temporary variable: 

static float sqrarg; 

#define SQR(a) (sqrarg=(a),sqrarg*sqrarg) 


The global variable sqrarg now has (and needs to keep) scope over the whole 
module, which is a little dangerous. Also, one needs a completely different macro to 
square expressions of type int. More seriously, this macro can fail if there are two 
SQR operations in a single expression. Since in C the order of evaluation of pieces of 
the expression is at the compiler’s discretion, the value of sqrarg in one evaluation 
of SQR can be that from the other evaluation in the same expression, producing 
nonsensical results. When we need a guaranteed-correct SQR macro, we use the 
following, which exploits the guaranteed complete evaluation of subexpressions in 
a conditional expression: 

static float sqrarg; 

#define SQR(a) ((sqrarg=(a)) == 0.0 ? 0.0 : sqrarg*sqrarg) 


A collection of macros for other simple operations is included in the file nrutil. h 
(see Appendix B) and used by many of our programs. Here are the synopses: 


SQR(a) 

Square a float value. 

DSQR(a) 

Square a double value. 

FMAX(a,b) 

Maximum of two float values. 

FMIN(a,b) 

Minimum of two float values. 

DMAX(a,b) 

Maximum of two double values. 

DMIN(a,b) 

Minimum of two double values. 

IMAX(a,b) 

Maximum of two int values. 

IMIM(a,b) 

Minimum of two int values. 

LMAX(a,b) 

Maximum of two long values. 

LMIN(a,b) 

Minimum of two long values. 

SIGH(a,b) 

Magnitude of a times sign of b. 


Scientific programming in C may someday become a bed of roses; for now, 
watch out for the thorns! 


CITED REFERENCES AND FURTHER READING: 

Harbison, S.R, and Steele, G.L., Jr. 1991, C: A Reference Manual , 3rd ed. (Englewood Cliffs, 
NJ: Prentice-Hall). [1] 

AT&T Bell Laboratories 1985, The C Programmer’s Handbook (Englewood Cliffs, NJ: Prentice- 
Hall). 

Kernighan, B., and Ritchie, D. 1978, The C Programming Language (Englewood Cliffs, NJ: 
Prentice-Hall). [Reference for K&R “traditional” C. Later editions of this book conform to 
the ANSI C standard.] 

Hogan, T. 1984, The C Programmer’s Handbook (Bowie, MD: Brady Communications). 
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1.3 Error, Accuracy, and Stability 

Although we assume no prior training of the reader in formal numerical analysis, 
we will need to presume a common understanding of a few key concepts. We will 
define these briefly in this section. 

Computers store numbers not with infinite precision but rather in some approxi¬ 
mation that can be packed into a fixed number of bits (binary digits) or bytes (groups 
of 8 bits). Almost all computers allow the programmer a choice among several 
different such representations or data types. Data types can differ in the number of 
bits utilized (the wordlength), but also in the more fundamental respect of whether 
the stored number is represented in fixed-point (int or long) or floating-point 
(float or double) format. 

A number in integer representation is exact. Arithmetic between numbers in 
integer representation is also exact, with the provisos that (i) the answer is not outside 
the range of (usually, signed) integers that can be represented, and (ii) that division 
is interpreted as producing an integer result, throwing away any integer remainder. 

In floating-point representation, a number is represented internally by a sign bit 
s (interpreted as plus or minus), an exact integer exponent e, and an exact positive 
integer mantissa M. Taken together these represent the number 

sxMx B e ~ E (1.3.1) 

where B is the base of the representation (usually B = 2, but sometimes B = 16), 
and E is the bias of the exponent, a fixed integer constant for any given machine 
and representation. An example is shown in Figure 1.3.1. 

Several floating-point bit patterns can represent the same number. If B = 2, 
for example, a mantissa with leading (high-order) zero bits can be left-shifted, i.e., 
multiplied by a power of 2, if the exponent is decreased by a compensating amount. 
Bit patterns that are “as left-shifted as they can be” are termed normalized. Most 
computers always produce normalized results, since these don’t waste any bits of 
the mantissa and thus allow a greater accuracy of the representation. Since the 
high-order bit of a properly normalized mantissa (when B = 2) is always one, some 
computers don’t store this bit at all, giving one extra bit of significance. 

Arithmetic among numbers in floating-point representation is not exact, even if 
the operands happen to be exactly represented (i.e., have exact values in the form of 
equation 1.3.1). For example, two floating numbers are added by first right-shifting 
(dividing by two) the mantissa of the smaller (in magnitude) one, simultaneously 
increasing its exponent, until the two operands have the same exponent. Low-order 
(least significant) bits of the smaller operand are lost by this shifting. If the two 
operands differ too greatly in magnitude, then the smaller operand is effectively 
replaced by zero, since it is right-shifted to oblivion. 

The smallest (in magnitude) floating-point number which, when added to the 
floating-point number 1.0, produces a floating-point result different from 1.0 is 
termed the machine accuracy e m . A typical computer with B = 2 and a 32-bit 
wordlength has e m around 3 x 10 -8 . (A more detailed discussion of machine 
characteristics, and a program to determine them, is given in §20.1.) Roughly 
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Figure 1.3.1. Floating point representations of numbers in a typical 32-bit (4-byte) format, (a) The 
number 1/2 (note the bias in the exponent); (b) the number 3; (c) the number 1/4; (d) the number 
10 -7 , represented to machine accuracy; (e) the same number 1CT 7 , but shifted so as to have the same 
exponent as the number 3; with this shifting, all significance is lost and 10“ 7 becomes zero; shifting to 
a common exponent must occur before two numbers can be added; (f) sum of the numbers 3 + 1CT 7 , 
which equals 3 to machine accuracy. Even though 10“ 7 can be represented accurately by itself, it cannot 
accurately be added to a much larger number. 


speaking, the machine accuracy e m is the fractional accuracy to which floating-point 
numbers are represented, corresponding to a change of one in the least significant 
bit of the mantissa. Pretty much any arithmetic operation among floating numbers 
should be thought of as introducing an additional fractional error of at least e m . This 
type of error is called roundoff error. 

It is important to understand that e m is not the smallest floating-point number 
that can be represented on a machine. That number depends on how many bits there 
are in the exponent, while e m depends on how many bits there are in the mantissa. 

Roundoff errors accumulate with increasing amounts of calculation. If, in the 
course of obtaining a calculated value, you perform N such arithmetic operations, 
you might be so lucky as to have a total roundoff error on the order of \/N e m , if 
the roundoff errors come in randomly up or down. (The square root comes from a 
random-walk.) However, this estimate can be very badly off the mark for two reasons: 

(i) It very frequently happens that the regularities of your calculation, or the 
peculiarities of your computer, cause the roundoff errors to accumulate preferentially 
in one direction. In this case the total will be of order Ne m . 

(ii) Some especially unfavorable occurrences can vastly increase the roundoff 
error of single operations. Generally these can be traced to the subtraction of two 
very nearly equal numbers, giving a result whose only significant bits are those 
(few) low-order ones in which the operands differed. You might think that such a 
“coincidental” subtraction is unlikely to occur. Not always so. Some mathematical 
expressions magnify its probability of occurrence tremendously. For example, in the 
familiar formula for the solution of a quadratic equation. 


x = 


2 a 


(1.3.2) 



the addition becomes delicate and roundoff-prone whenever ac -C b 2 . (In §5.6 we 
will learn how to avoid the problem in this particular case.) 
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Roundoff error is a characteristic of computer hardware. There is another, 
different, kind of error that is a characteristic of the program or algorithm used, 
independent of the hardware on which the program is executed. Many numerical 
algorithms compute “discrete” approximations to some desired “continuous” quan¬ 
tity. For example, an integral is evaluated numerically by computing a function 
at a discrete set of points, rather than at “every” point. Or, a function may be 
evaluated by summing a finite number of leading terms in its infinite series, rather 
than all infinity terms. In cases like this, there is an adjustable parameter, e.g., the 
number of points or of terms, such that the “true” answer is obtained only when 
that parameter goes to infinity. Any practical calculation is done with a finite, but 
sufficiently large, choice of that parameter. 

The discrepancy between the true answer and the answer obtained in a practical 
calculation is called the truncation error. Truncation error would persist even on a 
hypothetical, “perfect” computer that had an infinitely accurate representation and no 
roundoff error. As a general rule there is not much that a programmer can do about 
roundoff error, other than to choose algorithms that do not magnify it unnecessarily 
(see discussion of “stability” below). Truncation error, on the other hand, is entirely 
under the programmer’s control. In fact, it is only a slight exaggeration to say 
that clever minimization of truncation error is practically the entire content of the 
field of numerical analysis! 

Most of the time, truncation error and roundoff error do not strongly interact 
with one another. A calculation can be imagined as having, first, the truncation error 
that it would have if run on an infinite-precision computer, “plus” the roundoff error 
associated with the number of operations performed. 

Sometimes, however, an otherwise attractive method can be unstable. This 
means that any roundoff error that becomes “mixed into” the calculation at an early 
stage is successively magnified until it comes to swamp the true answer. An unstable 
method would be useful on a hypothetical, perfect computer; but in this imperfect 
world it is necessary for us to require that algorithms be stable — or if unstable 
that we use them with great caution. 

Here is a simple, if somewhat artificial, example of an unstable algorithm: 
Suppose that it is desired to calculate all integer powers of the so-called “Golden 
Mean,” the number given by 


It turns out (you can easily verify) that the powers <j> n satisfy a simple recursion 
relation, 


0 n+1 = (1-3.4) 

Thus, knowing the first two values <f>° = land (j) 1 = 0.61803398, we can successively 
apply (1.3.4) performing only a single subtraction, rather than a slower multiplication 
by (j), at each stage. 

Unfortunately, the recurrence (1.3.4) also has another solution, namely the value 
—1(\/5 + 1). Since the recurrence is linear, and since this undesired solution has 
magnitude greater than unity, any small admixture of it introduced by roundoff errors 
will grow exponentially. On a typical machine with 32-bit wordlength, (1.3.4) starts 
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to give completely wrong answers by about n = 16, at which point <p n is down to only 
10 -4 . The recurrence (1.3.4) is unstable, and cannot be used for the purpose stated. 

We will encounter the question of stability in many more sophisticated guises, 
later in this book. 

CITED REFERENCES AND FURTHER READING: 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag) 
Chapter 1. 

Kahaner, D., Moler, C., and Nash, S. 1989, Numerical Methods and Software (Englewood Cliffs 
NJ: Prentice Hall), Chapter 2. 

Johnson, L.W., and Riess, R.D. 1982, Numerical Analysis, 2nd ed. (Reading, MA: Addison 
Wesley), §1.3. 

Wilkinson, J.H. 1964, Rounding Errors in Algebraic Processes (Englewood Cliffs, NJ: Prentice 
Hall). 
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Solution of Linear 
Algebraic Equations 


2.0 Introduction 

A set of linear algebraic equations looks like this: 


anxi + < 1 x 2 X 2 ~b 01132:3 + • ■ • + QinXn — bx 

(I21X1 + CL22X2 + (I23X3 H-b CL2NXN = b2 

a 3 ixi + a 32 x 2 + 0,33X3 H-b o 3 N x N = b 3 (2.0.1) 


dMlXl + dM2X2 + CLM3X3 + ■ ■ ■ + O-MNXn = b]tf 

Here the N unknowns Xj, j = 1,2,.... A T are related by M equations. The 
coefficients a l3 with i = 1,2 ,M and j = 1,2 ,...,N are known numbers, as 
are the right-hand side quantities b t , i = 1,2,..., M. 

Nonsingular versus Singular Sets of Equations 

If N = M then there are as many equations as unknowns, and there is a good 
chance of solving for a unique solution set of xj’s. Analytically, there can fail to 
be a unique solution if one or more of the M equations is a linear combination of 
the others, a condition called row degeneracy, or if all equations contain certain 
variables only in exactly the same linear combination, called column degeneracy. 
(For square matrices, a row degeneracy implies a column degeneracy, and vice 
versa.) A set of equations that is degenerate is called singular. We will consider 
singular matrices in some detail in §2.6. 

Numerically, at least two additional things can go wrong: 

• While not exact linear combinations of each other, some of the equations 
may be so close to linearly dependent that roundoff errors in the machine 
render them linearly dependent at some stage in the solution process. In 
this case your numerical procedure will fail, and it can tell you that it 
has failed. 
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• Accumulated roundoff errors in the solution process can swamp the true 
solution. This problem particularly emerges if N is too large. The 
numerical procedure does not fail algorithmically. However, it returns a 
set of a;’s that are wrong, as can be discovered by direct substitution back 
into the original equations. The closer a set of equations is to being singular, 
the more likely this is to happen, since increasingly close cancellations 
will occur dining the solution. In fact, the preceding item can be viewed 
as the special case where the loss of significance is unfortunately total. 

Much of the sophistication of complicated “linear equation-solving packages” 
is devoted to the detection and/or correction of these two pathologies. As you 
work with large linear sets of equations, you will develop a feeling for when such 
sophistication is needed. It is difficult to give any firm guidelines, since there is no 
such thing as a “typical” linear problem. But here is a rough idea: Linear sets with 
N as large as 20 or 50 can be routinely solved in single precision (32 bit floating 
representations) without resorting to sophisticated methods, if the equations are not 
close to singular. With double precision (60 or 64 bits), this number can readily 
be extended to N as large as several hundred, after which point the limiting factor 
is generally machine time, not accuracy. 

Even larger linear sets, N in the thousands or greater, can be solved when the 
coefficients are sparse (that is, mostly zero), by methods that take advantage of the 
sparseness. We discuss this further in §2.7. 

At the other end of the spectrum, one seems just as often to encounter linear 
problems which, by their underlying nature, are close to singular. In this case, you 
might need to resort to sophisticated methods even for the case of N =10 (though 
rarely for N = 5). Singular value decomposition (§2.6) is a technique that can 
sometimes turn singular problems into nonsingular ones, in which case additional 
sophistication becomes unnecessary. 

Matrices 

Equation (2.0.1) can be written in matrix form as 

A x = b (2.0.2) 

Here the raised dot denotes matrix multiplication, A is the matrix of coefficients, and 
b is the right-hand side written as a column vector, 
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By convention, the first index on an element a ij denotes its row, the second 
index its column. For most purposes you don’t need to know how a matrix is stored 
in a computer’s physical memory; you simply reference matrix elements by their 
two-dimensional addresses, e.g., 1134 = a [3] [4], We have already seen, in §1.2, 
that this C notation can in fact hide a rather subtle and versatile physical storage 
scheme, “pointer to array of pointers to rows.” You might wish to review that section 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



34 


Chapter 2. Solution of Linear Algebraic Equations 


at this point. Occasionally it is useful to be able to peer through the veil, for example 
to pass a whole row a [i] [ j ], j=1,..., N by the reference a [i]. 

Tasks of Computational Linear Algebra 

We will consider the following tasks as falling in the general purview of this 
chapter: 


• Solution of the matrix equation A • x = b for an unknown vector x, where A 
is a square matrix of coefficients, raised dot denotes matrix multiplication, 
and b is a known right-hand side vector (§2.1—§2.10). 

• Solution of more than one matrix equation A • x j = b j, for a set of vectors 
Xj , j = 1,2 ,..., each corresponding to a different, known right-hand side 
vector b r In this task the key simplification is that the matrix A is held 
constant, while the right-hand sides, the b’s, are changed (§2.1—§2.10). 

• Calculation of the matrix A “ 1 which is the matrix inverse of a square matrix 
A, i.e., A • A -1 = A -1 A = 1, where 1 is the identity matrix (all zeros 
except for ones on the diagonal). This task is equivalent, for an N x N 
matrix A, to the previous task with N different b/s (j = 1,2,..., N), 
namely the unit vectors (by = all zero elements except for 1 in the jth 
component). The corresponding x’s are then the columns of the matrix 
inverse of A (§2.1 and §2.3). 

• Calculation of the determinant of a square matrix A (§2.3). 

If M < N, or if M = N but the equations are degenerate, then there 
are effectively fewer equations than unknowns. In this case there can be either no 
solution, or else more than one solution vector x. In the latter event, the solution space 
consists of a particular solution x p added to any linear combination of (typically) 
N — M vectors (which are said to be in the nullspace of the matrix A). The task 
of finding the solution space of A involves 

• Singular value decomposition of a matrix A. 

This subject is treated in §2.6. 

In the opposite case there are more equations than unknowns, M > N. When 
this occurs there is, in general, no solution vector x to equation (2.0.1), and the set 
of equations is said to be overdetermined. It happens frequently, however, that the 
best “compromise” solution is sought, the one that comes closest to satisfying all 
equations simultaneously. If closeness is defined in the least-squares sense, i.e., that 
the sum of the squares of the differences between the left- and right-hand sides of 
equation (2.0.1) be minimized, then the overdetermined linear problem reduces to 
a (usually) solvable linear problem, called the 

• Linear least-squares problem. 

The reduced set of equations to be solved can be written as the N x N set of equations 

(A t • A) • x = (A t • b) (2.0.4) 

where A T denotes the transpose of the matrix A. Equations (2.0.4) are called the 
normal equations of the linear least-squares problem. There is a close connection 
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between singular value decomposition and the linear least-squares problem, and the 
latter is also discussed in §2.6. You should be warned that direct solution of the 
normal equations (2.0.4) is not generally the best way to find least-squares solutions. 

Some other topics in this chapter include 

• Iterative improvement of a solution (§2.5) 

• Various special forms: symmetric positive-definite (§2.9), tridiagonal 
(§2.4), band diagonal (§2.4), Toeplitz (§2.8), Vandermonde (§2.8), sparse 
(§2.7) 

• Strassen’s “fast matrix inversion” (§2.11). 

Standard Subroutine Packages 

We cannot hope, in this chapter or in this book, to tell you everything there is to 
know about the tasks that have been defined above. In many cases you will have no 
alternative but to use sophisticated black-box program packages. Several good ones 
are available, though not always in C. LINPACK was developed at Argonne National 
Laboratories and deserves particular mention because it is published, documented, 
and available for free use. A successor to LINPACK, LAPACK, is now becoming 
available. Packages available commercially (though not necessarily in C) include 
those in the IMSL and NAG libraries. 

You should keep in mind that the sophisticated packages are designed with very 
large linear systems in mind. They therefore go to great effort to minimize not only 
the number of operations, but also the required storage. Routines for the various 
tasks are usually provided in several versions, corresponding to several possible 
simplifications in the form of the input coefficient matrix: symmetric, triangular, 
banded, positive definite, etc. If you have a large matrix in one of these forms, 
you should certainly take advantage of the increased efficiency provided by these 
different routines, and not just use the form provided for general matrices. 

There is also a great watershed dividing routines that are direct (i.e., execute 
in a predictable number of operations) from routines that are iterative (i.e., attempt 
to converge to the desired answer in however many steps are necessary). Iterative 
methods become preferable when the battle against loss of significance is in danger 
of being lost, either due to large N or because the problem is close to singular. We 
will treat iterative methods only incompletely in this book, in §2.7 and in Chapters 
18 and 19. These methods are important, but mostly beyond our scope. We will, 
however, discuss in detail a technique which is on the borderline between direct 
and iterative methods, namely the iterative improvement of a solution that has been 
obtained by direct methods (§2.5). 


CITED REFERENCES AND FURTHER READING: 

Golub, G.H., and Van Loan, C.F. 1989, Matrix Computations, 2nd ed. (Baltimore: Johns Hopkins 
University Press). 

Gill, P.E., Murray, W., and Wright, M.H. 1991, Numerical Linear Algebra and Optimization, vol. 1 
(Redwood City, CA: Addison-Wesley). 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
Chapter 4. 

Dongarra, J.J., et al. 1979, LINPACK User’s Guide (Philadelphia: S.I.A.M.). 
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Coleman, T.F., and Van Loan, C. 1988, Handbook for Matrix Computations (Philadelphia: S.I.A.M.). 

Forsythe, G.E., and Moler, C.B. 1967, Computer Solution of Linear Algebraic Systems (Engle¬ 
wood Cliffs, NJ: Prentice-Hall). 

Wilkinson, J.H., and Reinsch, C. 1971, Linear Algebra, vol. II of Handbook for Automatic Com¬ 
putation (New York: Springer-Verlag). 

Westlake, J.R. 1968, A Handbook of Numerical Matrix Inversion and Solution of Linear Equations 
(New York: Wiley). 

Johnson, L.W., and Riess, R.D. 1982, Numerical Analysis, 2nd ed. (Reading, MA: Addison- 
Wesley), Chapter 2. 

Ralston, A., and Rabinowitz, P. 1978, A First Course in Numerical Analysis, 2nd ed. (New York: 
McGraw-Hill), Chapter 9. 


2.1 Gauss-Jordan Elimination 

For inverting a matrix, Gauss-Jordan elimination is about as efficient as any 
other method. For solving sets of linear equations, Gauss-Jordan elimination 
produces both the solution of the equations for one or more right-hand side vectors 
b, and also the matrix inverse A -1 . However, its principal weaknesses are (i) that it 
requires all the right-hand sides to be stored and manipulated at the same time, and 
(ii) that when the inverse matrix is not desired, Gauss-Jordan is three times slower 
than the best alternative technique for solving a single linear set (§2.3). The method’s 
principal strength is that it is as stable as any other direct method, perhaps even a 
bit more stable when full pivoting is used (see below). 

If you come along later with an additional right-hand side vector, you can 
multiply it by the inverse matrix, of course. This does give an answer, but one that is 
quite susceptible to roundoff error, not nearly as good as if the new vector had been 
included with the set of right-hand side vectors in the first instance. 

For these reasons, Gauss-Jordan elimination should usually not be your method 
of first choice, either for solving linear equations or for matrix inversion. The 
decomposition methods in §2.3 are better. Why do we give you Gauss-Jordan at all? 
Because it is straightforward, understandable, solid as a rock, and an exceptionally 
good “psychological” backup for those times that something is going wrong and you 
think it might be your linear-equation solver. 

Some people believe that the backup is more than psychological, that Gauss- 
Jordan elimination is an “independent” numerical method. This turns out to be 
mostly myth. Except for the relatively minor differences in pivoting, described 
below, the actual sequence of operations performed in Gauss-Jordan elimination is 
very closely related to that performed by the routines in the next two sections. 

For clarity, and to avoid writing endless ellipses (• • •) we will write out equations 
only for the case of four equations and four unknowns, and with three different right- 
hand side vectors that are known in advance. You can write bigger matrices and 
extend the equations to the case of N x N matrices, with M sets of right-hand 
side vectors, in completely analogous fashion. The routine implemented below is, 
of course, general. 
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Elimination on Column-Augmented Matrices 

Consider the linear matrix equation 




( t o o o\ ' 
oiool 
o o 1 o I 
0 0 0 1 / 


Here the raised dot (•) signifies matrix multiplication, while the operator U just 
signifies column augmentation, that is, removing the abutting parentheses and 
making a wider matrix out of the operands of the U operator. 

It should not take you long to write out equation (2.1.1) and to see that it simply 
states that Xij is the ith component (i = 1,2,3,4) of the vector solution of the jth 
right-hand side (j = 1,2,3), the one whose coefficients are bij,i = 1,2,3,4; and 
that the matrix of unknown coefficients is the inverse matrix of a t] . In other 
words, the matrix solution of 


[A] • [xi U x 2 U x 3 U Y] = [bi U b 2 U b 3 U 1] (2.1.2) 


where A and Y are square matrices, the b»’s and x/s are column vectors, and 1 is 
the identity matrix, simultaneously solves the linear sets 



Now it is also elementary to verify the following facts about (2.1.1): 

• Interchanging any two rows of A and the corresponding rows of the b’s 
and of 1, does not change (or scramble in any way) the solution x’s and 
Y. Rather, it just corresponds to writing the same set of linear equations 
in a different order. 

• Likewise, the solution set is unchanged and in no way scrambled if we 
replace any row in A by a linear combination of itself and any other row, 
as long as we do the same linear combination of the rows of the b’s and 1 
(which then is no longer the identity matrix, of course). 

• Interchanging any two columns of A gives the same solution set only 
if we simultaneously interchange corresponding rows of the x’s and of 
Y. In other words, this interchange scrambles the order of the rows in 
the solution. If we do this, we will need to unscramble the solution by 
restoring the rows to their original order. 

Gauss-Jordan elimination uses one or more of the above operations to reduce 
the matrix A to the identity matrix. When this is accomplished, the right-hand side 
becomes the solution set, as one sees instantly from (2.1.2). 
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Pivoting 

In “Gauss-Jordan elimination with no pivoting,” only the second operation in 
the above list is used. The first row is divided by the element an (this being a 
trivial linear combination of the first row with any other row — zero coefficient for 
the other row). Then the right amount of the first row is subtracted from each other 
row to make all the remaining a»i’s zero. The first column of A now agrees with 
the identity matrix. We move to the second column and divide the second row by 
a- 22 , then subtract the right amount of the second row from rows 1,3, and 4, so as to 
make their entries in the second column zero. The second column is now reduced 
to the identity form. And so on for the third and fourth columns. As we do these 
operations to A, we of course also do the corresponding operations to the b’s and to 
1 (which by now no longer resembles the identity matrix in any way!). 

Obviously we will run into trouble if we ever encounter a zero element on the 
(then current) diagonal when we are going to divide by the diagonal element. (The 
element that we divide by, incidentally, is called the pivot element or pivot.) Not so 
obvious, but true, is the fact that Gauss-Jordan elimination with no pivoting (no use of 
the first or third procedures in the above list) is numerically unstable in the presence 
of any roundoff error, even when a zero pivot is not encountered. You must never do 
Gauss-Jordan elimination (or Gaussian elimination, see below) without pivoting! 

So what is this magic pivoting? Nothing more than interchanging rows {partial 
pivoting) or rows and columns (full pivoting), so as to put a particularly desirable 
element in the diagonal position from which the pivot is about to be selected. Since 
we don’t want to mess up the part of the identity matrix that we have already built up, 
we can choose among elements that are both (i) on rows below (or on) the one that 
is about to be normalized, and also (ii) on columns to the right (or on) the column 
we are about to eliminate. Partial pivoting is easier than full pivoting, because we 
don’t have to keep track of the permutation of the solution vector. Partial pivoting 
makes available as pivots only the elements already in the correct column. It turns 
out that partial pivoting is “almost” as good as full pivoting, in a sense that can be 
made mathematically precise, but which need not concern us here (for discussion 
and references, see [1 ]). To show you both variants, we do full pivoting in the routine 
in this section, partial pivoting in §2.3. 

We have to state how to recognize a particularly desirable pivot when we see 
one. The answer to this is not completely known theoretically. It is known, both 
theoretically and in practice, that simply picking the largest (in magnitude) available 
element as the pivot is a very good choice. A curiosity of this procedure, however, is 
that the choice of pivot will depend on the original scaling of the equations. If we take 
the third linear equation in our original set and multiply it by a factor of a million, it 
is almost guaranteed that it will contribute the first pivot; yet the underlying solution 
of the equations is not changed by this multiplication! One therefore sometimes sees 
routines which choose as pivot that element which would have been largest if the 
original equations had all been scaled to have their largest coefficient normalized to 
unity. This is called implicit pivoting. There is some extra bookkeeping to keep track 
of the scale factors by which the rows would have been multiplied. (The routines in 
§2.3 include implicit pivoting, but the routine in this section does not.) 

Finally, let us consider the storage requirements of the method. With a little 
reflection you will see that at every stage of the algorithm, either an element of A is 
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predictably a one or zero (if it is already in a part of the matrix that has been reduced 
to identity form) or else the exactly corresponding element of the matrix that started 
as 1 is predictably a one or zero (if its mate in A has not been reduced to the identity 
form). Therefore the matrix 1 does not have to exist as separate storage: The matrix 
inverse of A is gradually built up in A as the original A is destroyed. Likewise, 
the solution vectors x can gradually replace the right-hand side vectors b and share 
the same storage, since after each column in A is reduced, the corresponding row 
entry in the b’s is never again used. 


Here is the routine for Gauss-Jordan elimination with full pivoting: 


#include <math.h> 


#include "nrutil.h" 

#define SWAP(a,b) {temp=(a);(a)=(b);(b)=temp;> 


void gaussj(float **a, int n, float **b, int m) 

Linear equation solution by Gauss-Jordan elimination, equation (2.1.1) above. a[l. .n] [1. .n] 
is the input matrix, b [1. .n] [1. .m] is input containing the m right-hand side vectors. On 
output, a is replaced by its matrix inverse, and b is replaced by the corresponding set of solution 
vectors. 

{ 

int *indxc,*indxr, *ipi v; 
int i,icol,irow,j,k,l,ll; 
float big,dum,pivinv,temp; 


indxc=ivector(1,n); 
indxr=ivector(l,n); 
ipiv=ivector(l,n); 
for (j=l;j<=n;j++) ipiv[j]=0; 
for (i=l;i<=n;i++) { 
big=0.0; 

for (j=i;j<=n;j++) 
if (ipiv[j] != 1) 

for (k=l;k<=n;k++) { 
if (ipiv[k] == 0) { 

if (fabs(a[j] [k]) >= big) { 
big=fabs(a[j] [k]); 


The integer arrays ipiv, indxr, and indxc are 
used for bookkeeping on the pivoting. 


This is the main loop over the columns to be 
reduced. 

This is the outer loop of the search for a pivot 
element. 


icol=k; 

> 

> 

> 

++(ipiv[icol]); 

We now have the pivot element, so we interchange rows, if needed, to put the pivot 
element on the diagonal. The columns are not physically interchanged, only relabeled: 
indxc [i] , the column of the ith pivot element, is the ith column that is reduced, while 
indxr [i] is the row in which that pivot element was originally located. If indxr [i] ^ 
indxc [i] there is an implied column interchange. With this form of bookkeeping, the 
solution b's will end up in the correct order, and the inverse matrix will be scrambled 
by columns, 
if (irow != icol) { 

for (1=1;l<=n;l++) SWAP(a[irow][1],a[icol][1]) 
for (l=l;l<=m;l++) SWAP(b[irow] [1] ,b[icol] [1]) 

> 

indxr [i]=irow; We are now ready to divide the pivot row by the 

indxc [i]=icol; pivot element, located at irow and icol. 

if (a[icol][icol] == 0.0) nrerror("gaussj: Singular Matrix"); 
pivinv=l.0/a[icol][icol]; 
a[icol] [icol] =1.0; 

for (l=l;l<=n;l++) a[icol] [1] *= pivinv; 
for (1=1;l<=m;1++) b[icol][1] *= pivinv; 
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for (ll=l;ll<=n;ll++) Next, we reduce the rows... 

if (11 != icol) { ...except for the pivot one, of course. 

dum=a[ll] [icol] ; 
a[11][icol]=0.0; 

for (1=1;l<=n;l++) a[ll][1] -= a[icol] [1] *dum; 
for (l=l;l<=m;l++) b[ll] [1] -= b[icol][1]*dum; 

> 

> 

This is the end of the main loop over columns of the reduction. It only remains to unscram¬ 
ble the solution in view of the column interchanges. We do this by interchanging pairs of 
columns in the reverse order that the permutation was built up. 
for (l=n;l>=l;l—) { 

if (indxr[l] != indxc[l]) 
for (k=l ;k<=n;k++) 

SWAP (a [k] [indxr [1] ] ,a[k] [indxc[l]]); 

> And we are done. 

free_ivector(ipiv,1,n); 
free_ivector(indxr,l,n); 
free_ivector(indxc,l,n); 


Row versus Column Elimination Strategies 


The above discussion can be amplified by a modest amount of formalism. Row 
operations on a matrix A correspond to pre- (that is, left-) multiplication by some simple 
matrix R. For example, the matrix R with components 


1 iff — j and i ^ 2,4 
1 if * = 2, j = 4 
1 if * = 4, j = 2 
0 otherwise 


(2.1.5) 


effects the interchange of rows 2 and 4. Gauss-Jordan elimination by row operations alone 
(including the possibility of partial pivoting) consists of a series of such left-multiplications, 
yielding successively 


A • x = b 

(... r 3 . r 2 . r 3 . a) • x = • ■ • R 3 • R 2 • Ri ■ b 
(1) x = R 3 R 2 Ri b 
x = • • • R 3 • R 2 • Ri ■ b 


( 2 . 1 . 6 ) 


The key point is that since the R’s build from right to left, the right-hand side is simply 
transformed at each stage from one vector to another. 

Column operations, on the other hand, correspond to post-, or right-, multiplications 
by simple matrices, call them C. The matrix in equation (2.1.5), if right-multiplied onto a 
matrix A, will interchange A’s second and fourth columns. Elimination by column operations 
involves (conceptually) inserting a column operator, and also its inverse, between the matrix 
A and the unknown vector x: 

A • x = b 
A Ci Cr 1 x = b 
A Cl • C 2 •C 2 1 cr 1 X = b 

(A • Ci • C 2 • C 3 • • •) " ' C 3 1 • C 2 1 • CJ" 1 • x = b 
(i)...c 3 - 1 -c 2 - 1 -cr 1 -x = b 



(2.1.7) 
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which (peeling of the C 1 ’s one at a time) implies a solution 

x = Ci C 2 C 3 b (2.1.8) 

Notice the essential difference between equation (2.1.8) and equation (2.1.6). In the 
latter case, the C’s must be applied to b in the reverse order from that in which they become 
known. That is, they must all be stored along the way. This requirement greatly reduces 
the usefulness of column operations, generally restricting them to simple permutations, for 
example in support of full pivoting. 

CITED REFERENCES AND FURTHER READING: 

Wilkinson, J.H. 1965, The Algebraic Eigenvalue Problem (New York: Oxford University Press). [1] 
Carnahan, B., Luther, H.A., and Wilkes, J.O. 1969, Applied Numerical Methods (New York: 
Wiley), Example 5.2, p. 282. 

Bevington, P.R. 1969, Data Reduction and Error Analysis for the Physical Sciences (New York: 
McGraw-Hill), Program B-2, p. 298. 

Westlake, J.R. 1968, A Handbook of Numerical Matrix Inversion and Solution of Linear Equations 
(New York: Wiley). 

Ralston, A., and Rabinowitz, P. 1978, A First Course in Numerical Analysis, 2nd ed. (New York: 
McGraw-Hill), §9.3-1. 


2.2 Gaussian Elimination with Backsubstitution 

The usefulness of Gaussian elimination with backsubstitution is primarily 
pedagogical. It stands between full elimination schemes such as Gauss-Jordan, and 
triangular decomposition schemes such as will be discussed in the next section. 
Gaussian elimination reduces a matrix not all the way to the identity matrix, but 
only halfway, to a matrix whose components on the diagonal and above (say) remain 
nontrivial. Let us now see what advantages accrue. 

Suppose that in doing Gauss-Jordan elimination, as described in §2.1, we at 
each stage subtract away rows only below the then-current pivot element. When a 22 
is the pivot element, for example, we divide the second row by its value (as before), 
but now use the pivot row to zero only a 32 and 0,42, not ai2 (see equation 2.1.1). 
Suppose, also, that we do only partial pivoting, never interchanging columns, so that 
the order of the unknowns never needs to be modified. 

Then, when we have done this for all the pivots, we will be left with a reduced 
equation that looks like this (in the case of a single right-hand side vector): 

a'n a[ 2 
0 ^22 

0 0 

. 0 0 

Here the primes signify that the a’s and 6’s do not have their original numerical 
values, but have been modified by all the row operations in the elimination to this 
point. The procedure up to this point is termed Gaussian elimination. 
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Backsubstitution 


But how do we solve for the x’s? The last x (x 4 in this example) is already 
isolated, namely 

x A = b' A /a' AA (2.2.2) 

With the last x known we can move to the penultimate x, 

X 3 = —^—[63 — x A a 3A ] (2.2.3) 

°33 

and then proceed with the x before that one. The typical step is 


Xi = 


1 

o'. 


JV 

b 'i- Y. a 'n x i 


j=i +1 


(2.2.4) 


The procedure defined by equation (2.2.4) is called backsubstitution. The com¬ 
bination of Gaussian elimination and backsubstitution yields a solution to the set 
of equations. 

The advantage of Gaussian elimination and backsubstitution over Gauss-Jordan 
elimination is simply that the former is faster in raw operations count: The 
innermost loops of Gauss-Jordan elimination, each containing one subtraction and 
one multiplication, are executed N 3 and N‘ 2 M times (where there are N equations 
and M unknowns). The corresponding loops in Gaussian elimination are executed 
only | N 3 times (only half the matrix is reduced, and the increasing numbers of 
predictable zeros reduce the count to one-third), and \ N 2 M times, respectively. 
Each backsubstitution of a right-hand side is \N 2 executions of a similar loop (one 
multiplication plus one subtraction). For M«JV (only a few right-hand sides) 
Gaussian elimination thus has about a factor three advantage over Gauss-Jordan. 
(We could reduce this advantage to a factor 1.5 by not computing the inverse matrix 
as part of the Gauss-Jordan scheme.) 

For computing the inverse matrix (which we can view as the case of M = N 
right-hand sides, namely the N unit vectors which are the columns of the identity 
matrix), Gaussian elimination and backsubstitution at first glance require | N 3 (matrix 
reduction) +1 N 3 (right-hand side manipulations) +5-/V 3 (N backsubstitutions) 
= IN 3 loop executions, which is more than the iV 3 for Gauss-Jordan. However, the 
unit vectors are quite special in containing all zeros except for one element. If this 
is taken into account, the right-side manipulations can be reduced to only |iV 3 loop 
executions, and, for matrix inversion, the two methods have identical efficiencies. 

Both Gaussian elimination and Gauss-Jordan elimination share the disadvantage 
that all right-hand sides must be known in advance. The LU decomposition method 
in the next section does not share that deficiency, and also has an equally small 
operations count, both for solution with any number of right-hand sides, and for 
matrix inversion. For this reason we will not implement the method of Gaussian 
elimination as a routine. 


CITED REFERENCES AND FURTHER READING: 

Ralston, A., and Rabinowitz, P. 1978, A First Course in Numerical Analysis, 2nd ed. (New York: 
McGraw-Hill), §9.3-1. 
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Isaacson, E., and Keller, H.B. 1966, Analysis of Numerical Methods (New York: Wiley), §2.1. 
Johnson, L.W., and Riess, R.D. 1982, Numerical Analysis, 2nd ed. (Reading, MA: Addison- 
Wesley), §2.2.1. 

Westlake, J.R. 1968, A Handbook of Numerical Matrix Inversion and Solution of Linear Equations 
(New York: Wiley). 


2.3 LU Decomposition and Its Applications 

Suppose we are able to write the matrix A as a product of two matrices, 



L • U = A (2.3.1) 

where L is lower triangular (has elements only on the diagonal and below) and U 
is upper triangular (has elements only on the diagonal and above). For the case of 
a 4 x 4 matrix A, for example, equation (2.3.1) would look like this: 


r <*ii 
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0 
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0 
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|_a4i 
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044 _ 


(2.3.2) 

We can use a decomposition such as (2.3.1) to solve the linear set 

A • x = (L • U) • x = L • (U • x) = b (2.3.3) 

by first solving for the vector y such that 

L y = b (2.3.4) 

and then solving 

U • x = y (2.3.5) 

What is the advantage of breaking up one linear set into two successive ones? 
The advantage is that the solution of a triangular set of equations is quite trivial, as 
we have already seen in §2.2 (equation 2.2.4). Thus, equation (2.3.4) can be solved 
by forward substitution as follows, 



V i 


an 


Vi — 


1 

an 


i=i 


i = 2,3,...,N 


(2.3.6) 


while (2.3.5) can then be solved by backsubstitution exactly as in equations (2.2.2)- 
(2.2.4), 





VN 

0NN 


Vi ^ ' ftij -bj 

j=i+l 


i = N- 1,7V- 2, 


(2.3.7) 
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Equations (2.3.6) and (2.3.7) total (for each right-hand side b) N 2 executions 
of an inner loop containing one multiply and one add. If we have N right-hand 
sides which are the unit column vectors (which is the case when we are inverting a 
matrix), then taking into account the leading zeros reduces the total execution count 
of (2.3.6) from |W 3 to |7V 3 , while (2.3.7) is unchanged at |W 3 . 

Notice that, once we have the LU decomposition of A, we can solve with as 
many right-hand sides as we then care to, one at a time. This is a distinct advantage 
over the methods of §2.1 and §2.2. 

Performing the LU Decomposition 

How then can we solve for L and U, given A? First, we write out the 
i, jth component of equation (2.3.1) or (2.3.2). That component always is a sum 
beginning with 


OtilPlj + ■■■ — Cbij 

The number of terms in the sum depends, however, on whether i or j is the smaller 
number. We have, in fact, the three cases, 


i<j : 

OtilPlj + Olafcj + • 

• + OLiifiij — (X%j 

(2.3.8) 

i = j: 

OtilPlj + Cti2p2j + • 

• + QliiPjj = dij 

(2.3.9) 

i> j : 

+ cti2p2j + • 

• + fijj = 

(2.3.10) 


Equations (2.3.8)-(2.3.10) total N 2 equations for the N 2 + N unknown a’s and 
/?’s (the diagonal being represented twice). Since the number of unknowns is greater 
than the number of equations, we are invited to specify N of the unknowns arbitrarily 
and then try to solve for the others. In fact, as we shall see, it is always possible to take 

a* si i = l,...,N (2.3.11) 

A surprising procedure, now, is Crout’s algorithm, which quite trivially solves 
the set of N 2 + N equations (2.3.8)-(2.3.11) for all the a’s and /Ts by just arranging 
the equations in a certain order! That order is as follows: 

• Set an = 1, i = 1,..., N (equation 2.3.11). 

• For each j = 1,2,3,..., TV do these two procedures: First, for i = 
1,2,..., j, use (2.3.8), (2.3.9), and (2.3.11) to solve for /?y, namely 

i -1 

pi] = a ij — 'y ' a ikPkj- (2.3.12) 

k =1 

(When i = 1 in 2.3.12 the summation term is taken to mean zero.) Second, 
for i = j + 1 ,j + 2,..., AT use (2.3.10) to solve for ay, namely 

n ij = -p— — y ' a ikPkj ^ • (2.3.13) 



Be sure to do both procedures before going on to the next j. 
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Figure 2.3.1. Crout’s algorithm for LU decomposition of a matrix. Elements of the original matrix are 
modified in the order indicated by lower case letters: a, b, c, etc. Shaded boxes show the previously 
modified elements that are used in modifying two typical elements, each indicated by an “x”. 

If you work through a few iterations of the above procedure, you will see that 
the a’s and /3’s that occur on the right-hand side of equations (2.3.12) and (2.3.13) 
are already determined by the time they are needed. You will also see that every a ij 
is used only once and never again. This means that the corresponding a y or can 
be stored in the location that the a used to occupy: the decomposition is “in place.” 
[The diagonal unity elements a„ (equation 2.3.11) are not stored at all.] In brief, 
Crout’s method fills in the combined matrix of a’s and /3’s, 


"/3n 

/3l2 

/3l3 

0 U 

cm 

/?22 

/?23 

024 

cm 

cm 

033 

034, 

.cm 

cm 

cm 

044- 



by columns from left to right, and within each column from top to bottom (see 
Figure 2.3.1). 

What about pivoting? Pivoting (i.e., selection of a salubrious pivot element for 
the division in equation 2.3.13) is absolutely essential for the stability of Crout’s 
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method. Only partial pivoting (interchange of rows) can be implemented efficiently. 
However this is enough to make the method stable. This means, incidentally, that 
we don’t actually decompose the matrix A into LU form, but rather we decompose 
a rowwise permutation of A. (If we keep track of what that permutation is, this 
decomposition is just as useful as the original one would have been.) 

Pivoting is slightly subtle in Crout’s algorithm. The key point to notice is that 
equation (2.3.12) in the case of i = j (its final application) is exactly the same as 
equation (2.3.13) except for the division in the latter equation; in both cases the 
upper limit of the sum is k = j — 1 (= i — 1). This means that we don’t have to 
commit ourselves as to whether the diagonal element 0 jj is the one that happens 
to fall on the diagonal in the first instance, or whether one of the (undivided) a y ’s 
below it in the column, i = j +1,..., N, is to be “promoted” to become the diagonal 
0. This can be decided after all the candidates in the column are in hand. As you 
should be able to guess by now, we will choose the largest one as the diagonal 0 
(pivot element), then do all the divisions by that element en masse. This is Crout’s 
method with partial pivoting. Our implementation has one additional wrinkle: It 
initially finds the largest element in each row, and subsequently (when it is looking 
for the maximal pivot element) scales the comparison as if we had initially scaled all 
the equations to make their maximum coefficient equal to unity; this is the implicit 
pivoting mentioned in §2.1. 


#include <math.h> 

#include "nrutil.h" 

#define TINY 1.0e-20 A small number. 


void ludcmpCfloat **a, int n, int *indx, float *d) 

Given a matrix a[l. .n] [1. . n] , this routine replaces it by the LU decomposition of a rowwise 
permutation of itself, a and n are input, a is output, arranged as in equation (2.3.14) above; 
indx[l. .n] is an output vector that records the row permutation effected by the partial 
pivoting; d is output as ±1 depending on whether the number of row interchanges was even 
or odd, respectively. This routine is used in combination with lubksb to solve linear equations 
or invert a matrix. 

{ 

int i,imax,j,k; 
float big,dum,sum,temp; 

float *vv; vv stores the implicit scaling of each row. 


vv=vector(l,n); 

*d=1.0; No row interchanges yet. 

for (i=l;i<=n;i++) { Loop over rows to get the implicit scaling informa- 

big=0.0; tion. 

for (j=l;j<=n;j++) 

if ((temp=fabs(a[i] [j]) ) > big) big=temp; 
if (big == 0.0) nrerror("Singular matrix in routine ludcmp"); 

No nonzero largest element. 

w[i]=l.0/big; Save the scaling. 

> 

for (j=l; j<=n; j++) { This is the loop over columns of Crout's method, 

for (i=l;i<j ;i++) { This is equation (2.3.12) except for i = j. 

sum=a[i] [j] ; 

for (k=l;k<i;k++) sum -= a[i][k]*a[k] [j]; 
a[i] [j] =sum; 

> 

big=0.0; Initialize for the search for largest pivot element, 

for (i=j ;i<=n; i++) { This is i = j of equation (2.3.12) and i = j + 1... JV 

sum=a[i] [j] ; of equation (2.3.13). 

for (k=l;k<j;k++) 
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sum -= a[i] [k]*a[k] [j] ; 
a[i] [j]=sum; 

if ( (dum=vv[i]*fabs(sum)) >= big) { 

Is the figure of merit for the pivot better than the best so far? 
big=dum; 
imax=i; 

> 

> 

if (j != imax) { Do we need to interchange rows? 

for (k=l;k<=n;k++) { Yes, do so... 
dum=a[imax] [k] ; 
a [imax] [k]=a[j] [k] ; 
a[j] [k]=dum; 

> 

*d = -(*d); ...and change the parity of d. 

vv [imax] =vv [ j ] ; Also interchange the scale factor. 

> 

indx[j]=imax; 

if (a[j] [j] == 0.0) a[j] [j]=TINY; 

If the pivot element is zero the matrix is singular (at least to the precision of the 
algorithm). For some applications on singular matrices, it is desirable to substitute 
TINY for zero. 

if (j != n) { Now, finally, divide by the pivot element. 

dum=1.0/(a[j] [j]); 

for (i=j+l;i<=n;i++) a[i][j] *= dum; 

> 

> Go back for the next column in the reduction. 
free_vector(vv,l,n); 

> 

Here is the routine for forward substitution and backsubstitution, implementing 
equations (2.3.6) and (2.3.7). 

void lubksbffloat **a, int n, int *indx, float b[]) 

Solves the set of n linear equations AX = B. Herea[l..n] [l..n] is input, not as the matrix 
A but rather as its LU decomposition, determined by the routine ludcmp. indx [1. .n] is input 
as the permutation vector returned by ludcmp. b[l. .n] is input as the right-hand side vector 
B, and returns with the solution vector X. a, n, and indx are not modified by this routine 
and can be left in place for successive calls with different right-hand sides b. This routine takes 
into account the possibility that b will begin with many zero elements, so it is efficient for use 
in matrix inversion. 

{ 

int i,ii=0,ip,j; 
float sum; 

for (i=l;i<=n;i++) { When ii is set to a positive value, it will become the 

ip=indx[i]; index of the first nonvanishing element of b. We now 

sum=b[ip]; do the forward substitution, equation (2.3.6). The 

b[ip]=b[i] ; only new wrinkle is to unscramble the permutation 

if (ii) as we go. 

for (j=ii; j<=i-l;]++) sum -= a[i] [j] *b [j] ; 
else if (sum) ii=i; A nonzero element was encountered, so from now on we 
b[i]=sum; will have to do the sums in the loop above. 

> 

for (i=n;i>=l;i —) { Now we do the backsubstitution, equation (2.3.7). 

sum=b [i] ; 

for (j=i+l; j<=n; j++) sum -= a[i] [j] *b[j] ; 

b[i]=sum/a[i] [i] ; Store a component of the solution vector X. 

> All done! 

> 
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The LU decomposition in ludcmp requires about | A’ 3 executions of the inner 
loops (each with one multiply and one add). This is thus the operation count 
for solving one (or a few) right-hand sides, and is a factor of 3 better than the 
Gauss-Jordan routine gaussj which was given in §2.1, and a factor of 1.5 better 
than a Gauss-Jordan routine (not given) that does not compute the inverse matrix. 
For inverting a matrix, the total count (including the forward and backsubstitution 
as discussed following equation 2.3.7 above) is (| + \ + \)N 3 = N 3 , the same 
as gaussj. 

To summarize, this is the preferred way to solve the linear set of equations 
A • x = b: 

float **a,*b,d; 

int n,*indx; 

ludcmp(a,n,indx,fed); 

lubksb(a,n,indx,b); 

The answer x will be given back in b. Your original matrix A will have 
been destroyed. 

If you subsequently want to solve a set of equations with the same A but a 
different right-hand side b, you repeat only 

lubksb(a,n,indx,b); 

not, of course, with the original matrix A, but with a and indx as were already 
set by ludcmp. 

Inverse of a Matrix 

Using the above LU decomposition and backsubstitution routines, it is com¬ 
pletely straightforward to find the inverse of a matrix column by column. 


#define N ... 
float **a,**y,d,*col; 
int i,j,*indx; 


ludcmp (a, N, indx, fed); Decompose the matrix just once. 

for(j=l; j<=N; j++) { Find inverse by columns, 

for(i=l;i<=N;i++) col[i]=0.0; 
col [j] =1.0; 
lubksb(a,N,indx, col); 
for(i=l;i<=N;i++) y[i] [j]=col[i] ; 



The matrix y will now contain the inverse of the original matrix a, which will have 
been destroyed. Alternatively, there is nothing wrong with using a Gauss-Jordan 
routine like gaussj (§2.1) to invert a matrix in place, again destroying the original. 
Both methods have practically the same operations count. 
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Incidentally, if you ever have the need to compute A 1 B from matrices A 
and B, you should LU decompose A and then backsubstitute with the columns of 
B instead of with the unit vectors that would give A’s inverse. This saves a whole 
matrix multiplication, and is also more accurate. 


Determinant of a Matrix 


The determinant of an LU decomposed matrix is just the product of the 
diagonal elements, 


N 

del = n (2-3.15) 

ipi 

We don’t, recall, compute the decomposition of the original matrix, but rather a 
decomposition of a rowwise permutation of it. Luckily, we have kept track of 
whether the number of row interchanges was even or odd, so we just preface the 
product by the corresponding sign. (You now finally know the purpose of setting 
d in the routine ludcmp.) 

Calculation of a determinant thus requires one call to ludcmp, with no subse¬ 
quent backsubstitutions by lubksb. 

#define N ... 
float **a,d; 
int j,*indx; 

ludcmp(a,N,indx,&d) ; This returns d as d=l. 

for(j=l; j<=N; j++) d *= a[j] [j] ; 

The variable d now contains the determinant of the original matrix a, which will 
have been destroyed. 

For a matrix of any substantial size, it is quite likely that the determinant will 
overflow or underflow your computer’s floating-point dynamic range. In this case 
you can modify the loop of the above fragment and (e.g.) divide by powers of ten, 
to keep track of the scale separately, or (e.g.) accumulate the sum of logarithms of 
the absolute values of the factors and the sign separately. 

Complex Systems of Equations 

If your matrix A is real, but the right-hand side vector is complex, say b + id, then (i) 
LU decompose A in the usual way, (ii) backsubstitute b to get the real part of the solution 
vector, and (iii) backsubstitute d to get the imaginary part of the solution vector. 

If the matrix itself is complex, so that you want to solve the system 

(A + *C) • (x + iy) = (b + id) (2.3.16) 

then there are two possible ways to proceed. The best way is to rewrite ludcmp and lubksb 
as complex routines. Complex modulus substitutes for absolute value in the construction of 
the scaling vector vv and in the search for the largest pivot elements. Everything else goes 
through in the obvious way, with complex arithmetic used as needed. (See §§1.2 and 5.4 for 
discussion of complex arithmetic in C.) 
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A quick-and-dirty way to solve complex systems is to take the real and imaginary 
parts of (2.3.16), giving 


A • x — C • y = b 
C x + A y = d 


(2.3.17) 


which can be written as a 2N x 2 N set of real equations, 

(S * c ) •($■*(!) < 2 - 318 > 

and then solved with ludcmp and lubksb in their present forms. This scheme is a factor of 
2 inefficient in storage, since A and C are stored twice. It is also a factor of 2 inefficient 
in time, since the complex multiplies in a complexified version of the routines would each 
use 4 real multiplies, while the solution of a 2N x 2N problem involves 8 times the work of 
an N x N one. If you can tolerate these factor-of-two inefficiencies, then equation (2.3.18) 
is an easy way to proceed. 


CITED REFERENCES AND FURTHER READING: 

Golub, G.H., and Van Loan, C.F. 1989, Matrix Computations, 2nd ed. (Baltimore: Johns Hopkins 
University Press), Chapter 4. 

Dongarra, J.J., et al. 1979, UNPACK User’s Guide (Philadelphia: S.I.A.M.). 

Forsythe, G.E., Malcolm, M.A., and Moler, C.B. 1977, Computer Methods for Mathematical 
Computations (Englewood Cliffs, NJ: Prentice-Hall), §3.3, and p. 50. 

Forsythe, G.E., and Moler, C.B. 1967, Computer Solution of Linear Algebraic Systems (Engle¬ 
wood Cliffs, NJ: Prentice-Hall), Chapters 9, 16, and 18. 

Westlake, J.R. 1968, A Handbook of Numerical Matrix Inversion and Solution of Linear Equations 
(New York: Wiley). 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
§4.2. 

Ralston, A., and Rabinowitz, P. 1978, A First Course in Numerical Analysis, 2nd ed. (New York: 
McGraw-Hill), §9.11. 

Horn, R.A., and Johnson, C.R. 1985, Matrix Analysis (Cambridge: Cambridge University Press). 


2.4 Tridiagonal and Band Diagonal Systems 
of Equations 

The special case of a system of linear equations that is tridiagonal, that is, has 
nonzero elements only on the diagonal plus or minus one column, is one that occurs 
frequently. Also common are systems that are band diagonal, with nonzero elements 
only along a few diagonal lines adjacent to the main diagonal (above and below). 

For tridiagonal sets, the procedures of LU decomposition, forward- and back- 
substitution each take only O(N) operations, and the whole solution can be encoded 
very concisely. The resulting routine tridag is one that we will use in later chapters. 

Naturally, one does not reserve storage for the full N x N matrix, but only for 
the nonzero components, stored as three vectors. The set of equations to be solved is 


'bi ci 0 ••• 


Ul 


n 

a 2 62 C 2 ■ ■ ■ 


«2 


r 2 

••• ojv-l friV-1 Cjv -1 


UN-1 


Tjv-l 

• • • 0 ajv 6jv - 


- Un - 


- r N . 



(2.4.1) 
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Notice that a\ and cn are undefined and are not referenced by the routine that follows. 

#include "nrutil.h" 

void tridag(float a[] , float b[], float c[], float r[], float u[] , 
unsigned long n) 

Solves for a vector u[l. .n] the tridiagonal linear set given by equation (2.4.1). a[l. .n], 
b[l. .n], c[l. .n], and r[l. .n] are input vectors and are not modified. 

{ 

unsigned long j; 
float bet,*gam; 

gam=vector(l,n); One vector of workspace, gam is needed, 

if (b[l] == 0.0) nrerror("Error 1 in tridag"); 

If this happens then you should rewrite your equations as a set of order N — 1, with ui 
trivially eliminated, 
u [1] =r [1] / (bet=b [1] ) ; 

for (j=2;j<=n;j++) { Decomposition and forward substitution, 

gam [j]=c [j-1] /bet; 
bet=b[j]-a[j]*gam[j]; 

if (bet == 0.0) nrerror ("Error 2 in tridag"); Algorithm fails; see be- 

u [j] = (r [j] -a [j] *u [j-1] ) /bet; low. 

} 

for (j=(n-l);j>=l;j—) 

u[j] -= gam[j+l]*u[j+l] ; Backsubstitution. 

free_vector(gam,1 ,n) ; 


There is no pivoting in tridag. It is for this reason that tridag can fail even 
when the underlying matrix is nonsingular: A zero pivot can be encountered even for 
a nonsingular matrix. In practice, this is not something to lose sleep about. The kinds 
of problems that lead to tridiagonal linear sets usually have additional properties 
which guarantee that the algorithm in tridag will succeed. For example, if 

\bj\ > \aj\ + \ Cj \ j = (2.4.2) 

(called diagonal dominance ) then it can be shown that the algorithm cannot encounter 
a zero pivot. 

It is possible to construct special examples in which the lack of pivoting in the 
algorithm causes numerical instability. In practice, however, such instability is almost 
never encountered — unlike the general matrix problem where pivoting is essential. 

The tridiagonal algorithm is the rare case of an algorithm that, in practice, is 
more robust than theory says it should be. Of course, should you ever encounter a 
problem for which tridag fails, you can instead use the more general method for 
band diagonal systems, now described (routines bandec and banbks). 

Some other matrix forms consisting of tridiagonal with a small number of 
additional elements (e.g., upper right and lower left comers) also allow rapid 
solution; see §2.7. 

Band Diagonal Systems 

Where tridiagonal systems have nonzero elements only on the diagonal plus or minus 
one, band diagonal systems are slightly more general and have (say) mi > 0 nonzero elements 
immediately to the left of (below) the diagonal and m 2 > 0 nonzero elements immediately to 
its right (above it). Of course, this is only a useful classification if mi and m2 are both -C N. 
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In that case, the solution of the linear system by LU decomposition can be accomplished 
much faster, and in much less storage, than for the general N x N case. 

The precise definition of a band diagonal matrix with elements ciij is that 

a,ij = 0 when j> i + m 2 or i > j + mi (2.4.3) 

Band diagonal matrices are stored and manipulated in a so-called compact form, which results 
if the matrix is tilted 45° clockwise, so that its nonzero elements lie in a long, narrow 
matrix with mi + 1 + m 2 columns and N rows. This is best illustrated by an example: 
The band diagonal matrix 


/3 1 0 0 0 0 0\ 

4 1 5 0 0 0 0 

9 2 6 5 0 0 0 

0 3 5 8 9 0 0 

0 0 7 9 3 2 0 

0 0 0 3 8 4 6 

\0 0 0 0 2 4 4/ 


(2.4.4) 


which has N = 7, mi = 2, and m 2 = 1, is stored compactly as the 7 x 4 matrix. 


fx x 3 1 \ 
a: 4 1 5 
9 2 6 5 
3 5 8 9 
7 9 3 2 

3 8 4 6 
\2 4 4 1 / 


(2.4.5) 


Here x denotes elements that are wasted space in the compact format; these will not be 
referenced by any manipulations and can have arbitrary values. Notice that the diagonal 
of the original matrix appears in column mi + 1, with subdiagonal elements to its left, 
superdiagonal elements to its right. 

The simplest manipulation of a band diagonal matrix, stored compactly, is to multiply 
it by a vector to its right. Although this is algorithmically trivial, you might want to study 
the following routine carefully, as an example of how to pull nonzero elements Oij out of the 
compact storage format in an orderly fashion. 


#include "nrutil.h" 

void banmul(float **a, unsigned long n, int ml, int m2, float x[], float b[]) 
Matrix multiply b = A • x, where A is band diagonal with ml rows below the diagonal and m2 
rows above. The input vector x and output vector b are stored as x[l. .n] and b[l. .n], 
respectively. The array a[l. .n] [1. ,ml+m2+l] stores A as follows: The diagonal elements 
are in a[l. .n] [ml+1] . Subdiagonal elements are in a[j . .n] [1. .ml] (with j > 1 ap¬ 
propriate to the number of elements on each subdiagonal). Superdiagonal elements are in 
a[l. .j] [ml+2. ,ml+m2+l] with j < n appropriate to the number of elements on each su¬ 
perdiagonal. 

{ 

unsigned long i,j,k,tmploop; 

for (i=l;i<=n;i++) { 
k=i-ml-l; 

tmploop=LMIN(ml+m2+l,n-k); 
b[i]=0.0; 

for (j=LMAX(l,l-k); j<=tmploop; j++) b[i] += a[i] [j]*x[j+k] ; 

> 

> 
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It is not possible to store the LU decomposition of a band diagonal matrix A quite 
as compactly as the compact form of A itself. The decomposition (essentially by Crout’s 
method, see §2.3) produces additional nonzero “fill-ins.” One straightforward storage scheme 
is to return the upper triangular factor ( U ) in the same space that A previously occupied, and 
to return the lower triangular factor ( L ) in a separate compact matrix of size N x mi. The 
diagonal elements of U (whose product, times d= ±1, gives the determinant) are returned 
in the first column of A’s storage space. 

The following routine, bandec, is the band-diagonal analog of ludcmp in §2.3: 

#include <math.h> 

#define SWAP(a,b) fdum=(a);(a)=(b);(b)=dum;} 

#define TINY 1.0e-20 

void bandec(float **a, unsigned long n, int ml, int m2, float **al, 
unsigned long indx[], float *d) 

Given an n x n band diagonal matrix A with ml subdiagonal rows and m2 superdiagonal rows, 
compactly stored in the array a[l. .n] [1. ,ml+m2+l] as described in the comment for routine 
banmul, this routine constructs an LU decomposition of a rowwise permutation of A. The upper 
triangular matrix replaces a, while the lower triangular matrix is returned in al [1. ,n] [1. .ml] . 
indx[l. .n] is an output vector which records the row permutation effected by the partial 
pivoting; d is output as ±1 depending on whether the number of row interchanges was even 
or odd, respectively. This routine is used in combination with banbks to solve band-diagonal 
sets of equations, 
f 

unsigned long i,j,k,l; 
int mm; 
float dum; 

mm=ml+m2+l; 

l=ml; 

for (i=l ; i<=ml; i++) { Rearrange the storage a bit. 

for (j=ml+2-i; j<=mm; j++) a[i] [j-l]=a[i] [j] ; 

1—; 

for (j =mm-l; j<=mm;] ++) a[i] [j]=0.0; 

> 

*d=l.0; 
l=ml; 

for (k=l;k<=n;k++) { For each row... 

dum=a[k] [1] ; 
i=k; 

if (1 < n) 1++; 

for (j=k+l; j<=l; j++) { Find the pivot element, 

if (fabs(a[j] [1]) > fabs(dum)) { 
dum=a[j] [1] ; 

i=ji 

> 

> 

indx [k] =i; 

if (dum == 0.0) a[k] [1]=TINY; 

Matrix is algorithmically singular, but proceed anyway with TINY pivot (desirable in 
some applications). 

if (i != k) { Interchange rows. 

*d = -(*d); 

for (j=l; j<=mm; j++) SWAP(a[k] [j] ,a[i] [j]) 

> 

for (i=k+l;i<=l;i++) { Do the elimination. 

dum=a[i] [l]/a[k] [1] ; 
al[k] [i-k]=dum; 

for (j=2; j<=mm; j++) a[i] [j-l] =a[i] [j]-dum*a[k] [j] ; 
a[i] [mm] =0.0; 

> 

> 

> 
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Some pivoting is possible within the storage limitations of bandec, and the above 
routine does take advantage of the opportunity. In general, when TINY is returned as a 
diagonal element of U, then the original matrix (perhaps as modified by roundoff error) 
is in fact singular. In this regard, bandec is somewhat more robust than tridag above, 
which can fail algorithmically even for nonsingular matrices; bandec is thus also useful (with 
mi = m 2 = 1) for some ill-behaved tridiagonal systems. 

Once the matrix A has been decomposed, any number of right-hand sides can be solved in 
turn by repeated calls to banbks, the backsubstitution routine whose analog in §2.3 is lubksb. 


#define SWAP(a,b) fdum=(a);(a)=(b);(b)=dum;} 

void banbks(float **a, unsigned long n, int ml, int m2, float **al, 
unsigned long indx[], float b[]) 

Given the arrays a, al, and indx as returned from bandec, and given a right-hand side vector 
b [1. .n] , solves the band diagonal linear equations A ■ x = b. The solution vector x overwrites 
b[l. .n] . The other input arrays are not modified, and can be left in place for successive calls 
with different right-hand sides. 

{ 

unsigned long i,k,l; 
int mm; 
float dum; 

mm=ml+m2+l; 
l=ml; 

for (k=l ;k<=n;k++) { Forward substitution, unscrambling the permuted rows 

i=indx [k] ; as we go. 

if (i ! = k) SWAP(b[k] ,b[i]) 
if (1 < n) 1++; 

for (i=k+l;i<=l;i++) b[i] -= al [k] [i-k] *b [k] ; 

> 

1 = 1 ; 

for (i=n;i>=l;i—) { Backsubstitution. 

dum=b [i] ; 

for (k=2;k<=l;k++) dum -= a[i][k]*b[k+i-l]; 
b[i]=dum/a[i] [1] ; 
if (1 < mm) 1++; 

> 

> 


The routines bandec and banbks are based on the Handbook routines bandetl and 
bansoll in [1 ]. 
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Figure 2.5.1. Iterative improvement of the solution to A • x = b. The first guess x + <5x is multiplied by 
A to produce b + (5b. The known vector b is subtracted, giving <5b. The linear set with this right-hand 
side is inverted, giving <5x. This is subtracted from the first guess giving an improved solution x. 


2.5 Iterative Improvement of a Solution to 
Linear Equations 

Obviously it is not easy to obtain greater precision for the solution of a linear 
set than the precision of your computer’s floating-point word. Unfortunately, for 
large sets of linear equations, it is not always easy to obtain precision equal to, or 
even comparable to, the computer’s limit. In direct methods of solution, roundoff 
errors accumulate, and they are magnified to the extent that your matrix is close 
to singular. You can easily lose two or three significant figures for matrices which 
(you thought) were far from singular. 

If this happens to you, there is a neat trick to restore the full machine precision, 
called iterative improvement of the solution. The theory is very straightforward (see 
Figure 2.5.1): Suppose that a vector x is the exact solution of the linear set 

A • x = b (2.5.1) 

You don’t, however, know x. You only know some slightly wrong solution x + Sx, 
where Sx is the unknown error. When multiplied by the matrix A, your slightly wrong 
solution gives a product slightly discrepant from the desired right-hand side b, namely 

A • (x + <5x) = b + Sb (2.5.2) 

Subtracting (2.5.1) from (2.5.2) gives 



A • <5x = £b 


(2.5.3) 
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But (2.5.2) can also be solved, trivially, for 5b. Substituting this into (2.5.3) gives 

A • Sx = A • (x + 5x) — b (2.5.4) 

In this equation, the whole right-hand side is known, since x + Sx is the wrong 
solution that you want to improve. It is essential to calculate the right-hand side 
in double precision, since there will be a lot of cancellation in the subtraction of b. 
Then, we need only solve (2.5.4) for the error Sx, then subtract this from the wrong 
solution to get an improved solution. 

An important extra benefit occurs if we obtained the original solution by LU 
decomposition. In this case we already have the LU decomposed form of A, and all 
we need do to solve (2.5.4) is compute the right-hand side and backsubstitute! 

The code to do all this is concise and straightforward: 

#include "nrutil.h" 

void mprove(float **a, float **alud, int n, int indx[], float b[], float x[]) 
Improves a solution vector x[l. .n] of the linear set of equations A X = B. The matrix 
a[l . .n] [1. .n] , and the vectors b[l. .n] and x[l. .n] are input, as is the dimension n. 
Also input is alud[l. .n] [1. .n] , the LU decomposition of a as returned by ludcmp, and 
the vector indx[l. .n] also returned by that routine. On output, only x[l. .n] is modified, 
to an improved set of values. 

{ 

void lubksb(float **a, int n, int *indx, float b[]); 
int j,i; 
double sdp; 
float *r; 

r=vector(l,n); 

for (i=l;i<=n;i++) { Calculate the right-hand side, accumulating 

sdp = -b[i] ; the residual in double precision, 

for (j=l; j<=n; j++) sdp += a[i] [j]*x[j] ; 
r[i]=sdp; 

> 

lubksb ( alud,n,indx,r) ; Solve for the error term, 

for (i=l;i<=n;i++) x[i] -=r[i]; and subtract it from the old solution. 

free_vector(r,l,n); 


You should note that the routine ludcmp in §2.3 destroys the input matrix as 
it LU decomposes it. Since iterative improvement requires both the original matrix 
and its LU decomposition, you will need to copy A before calling ludcmp. Likewise 
lubksb destroys b in obtaining x, so make a copy of b also. If you don’t mind 
this extra storage, iterative improvement is highly recommended: It is a process 
of order only N 2 operations (multiply vector by matrix, and backsubstitute — see 
discussion following equation 2.3.7); it never hurts; and it can really give you your 
money’s worth if it saves an otherwise ruined solution on which you have already 
spent of order N 3 operations. 

You can call mprove several times in succession if you want. Unless you are 
starting quite far from the true solution, one call is generally enough; but a second 
call to verify convergence can be reassuring. 
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More on Iterative Improvement 

It is illuminating (and will be useful later in the book) to give a somewhat more solid 
analytical foundation for equation (2.5.4), and also to give some additional results. Implicit in 
the previous discussion was the notion that the solution vector x + 5x has an error term; but 
we neglected the fact that the LU decomposition of A is itself not exact. 

A different analytical approach starts with some matrix Bo that is assumed to be an 
approximate inverse of the matrix A, so that Bo • A is approximately the identity matrix 1. 
Define the residual matrix R of Bo as 

R = 1 - B 0 A (2.5.5) 

which is supposed to be “small” (we will be more precise below). Note that therefore 

Bo A = 1 - R 

Next consider the following formal manipulation: 

A" 1 = A -1 • (Bo 1 • Bo) = (A -1 • Bo l ) ■ B 0 = (B 0 • A) -1 • B 0 

= (1 - R)" 1 • Bo = (1 + R + R 2 + R 3 + • • •) • Bo 

We can define the nth partial sum of the last expression by 

B n 0 (1 + R+ ••• + R") - Bo 

so that Boo —► A -1 , if the limit exists. 

It now is straightforward to verify that equation (2.5.8) satisfies some interesting 
recurrence relations. As regards solving A • x = b, where x and b are vectors, define 

x„ = B„ b (2.5.9) 

Then it is easy to show that 

x n+ i = x n + B 0 • (b - A • x n ) (2.5.10) 

This is immediately recognizable as equation (2.5.4), with — Sx = x„+i — x„, and with Bo 
taking the role of A -1 . We see, therefore, that equation (2.5.4) does not require that the LU 
decomposition of A be exact, but only that the implied residual R be small. In rough terms, if 
the residual is smaller than the square root of your computer’s roundoff error, then after one 
application of equation (2.5.10) (that is, going from xo = Bo • b to xi) the first neglected term, 
of order R 2 , will be smaller than the roundoff error. Equation (2.5.10), like equation (2.5.4), 
moreover, can be applied more than once, since it uses only Bo, and not any of the higher B’s. 

A much more surprising recurrence which follows from equation (2.5.8) is one that more 
than doubles the order n at each stage: 

B 2 „+i = 2B„ — B„ • A • B„ n = 0,1, 3, 7,... (2.5.11) 

Repeated application of equation (2.5.11), from a suitable starting matrix Bo, converges 
quadratically to the unknown inverse matrix A -1 (see §9.4 for the definition of “quadrati- 
cally”). Equation (2.5.11) goes by various names, including Schultz’s Method and Hotelling’s 
Method', see Pan and Reif [1] for references. In fact, equation (2.5.11) is simply the iterative 
Newton-Raphson method of root-finding (§9.4) applied to matrix inversion. 

Before you get too excited about equation (2.5.11), however, you should notice that it 
involves two full matrix multiplications at each iteration. Each matrix multiplication involves 
N 3 adds and multiplies. But we already saw in §§2.1—2.3 that direct inversion of A requires 
only N 3 adds and N 3 multiplies in toto. Equation (2.5.11) is therefore practical only when 
special circumstances allow it to be evaluated much more rapidly than is the case for general 
matrices. We will meet such circumstances later, in §13.10. 

In the spirit of delayed gratification, let us nevertheless pursue the two related issues: 
When does the series in equation (2.5.7) converge; and what is a suitable initial guess Bo (if, 
for example, an initial LU decomposition is not feasible)? 


(2.5.6) 

(2.5.7) 

(2.5.8) 



s o- i 


imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 




58 


Chapter 2. Solution of Linear Algebraic Equations 


We can define the norm of a matrix as the largest amplification of length that it is 
able to induce on a vector, 

IIRII Ee max (2.5.12) 

v#o |v| 

If we let equation (2.5.7) act on some arbitrary right-hand side b, as one wants a matrix inverse 
to do, it is obvious that a sufficient condition for convergence is 

||R|| < 1 (2.5.13) 

Pan and Reif [1 ] point out that a suitable initial guess for Bo is any sufficiently small constant 
e times the matrix transpose of A, that is, 

B 0 = eA T or R = 1 - eA T • A (2.5.14) 

To see why this is so involves concepts from Chapter 11; we give here only the briefest sketch: 
A t • A is a symmetric, positive definite matrix, so it has real, positive eigenvalues. In its 
diagonal representation, R takes the form 

R = diag(l — rAi, 1 — eA 2 ,.... 1 — eA jv) (2.5.15) 

where all the A, ’s are positive. Evidently any e satisfying 0 < e < 2/(max, Ai) will give 
||R|[ 1. It is not difficult to show that the optimal choice for e, giving the most rapid 

convergence for equation (2.5.11), is 

e = 2/(max A < + min Ai) (2.5.16) 

Rarely does one know the eigenvalues of A T • A in equation (2.5.16). Pan and Reif 
derive several interesting bounds, which are computable directly from A. The following 
choices guarantee the convergence of B n as n —? oo, 

e < 1 I a 2 *. or e < 1 / | max \dij\ x max \ajj | j (2.5.17) 
' i,k ' ^ * j 3 i ‘ 

The latter expression is truly a remarkable formula, which Pan and Reif derive by noting that 
the vector norm in equation (2.5.12) need not be the usual La norm, but can instead be either 
the Loo (max) norm, or the Li (absolute value) norm. See their work for details. 

Another approach, with which we have had some success, is to estimate the largest 
eigenvalue statistically, by calculating Si = |A • Vi| 2 for several unit vector Vi’s with randomly 
chosen directions in JV-space. The largest eigenvalue A can then be bounded by the maximum 
of 2maxs, and 2A’Var(.Si)/p.(s' ( ), where Var and p denote the sample variance and mean, 
respectively. 
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2.6 Singular Value Decomposition 

There exists a very powerful set of techniques for dealing with sets of equations 
or matrices that are either singular or else numerically very close to singular. In many 
cases where Gaussian elimination and LU decomposition fail to give satisfactory 
results, this set of techniques, known as singular value decomposition, or SVD, 
will diagnose for you precisely what the problem is. In some cases, SVD will 
not only diagnose the problem, it will also solve it, in the sense of giving you a 
useful numerical answer, although, as we shall see, not necessarily “the” answer 
that you thought you should get. 

SVD is also the method of choice for solving most linear least-squares problems. 
We will outline the relevant theory in this section, but defer detailed discussion of 
the use of SVD in this application to Chapter 15, whose subject is the parametric 
modeling of data. 

SVD methods are based on the following theorem of linear algebra, whose proof 
is beyond our scope: Any M x N matrix A whose number of rows M is greater than 
or equal to its number of columns N, can be written as the product of an M x N 
column-orthogonal matrix U, an N x N diagonal matrix W with positive or zero 
elements (the singular values ), and the transpose of an N x N orthogonal matrix V. 
The various shapes of these matrices will be made clearer by the following tableau: 


/ \ ( \ 


A 


U 



V ) \ ) 


( 2 . 6 . 1 ) 


The matrices U and V are each orthogonal in the sense that their columns are 
orthonormal. 


J2 U ikUin = S kn 


4n 

l<k<N 

1 < n < N 

(2.6.2) 

Skn 

l<k<N 

1 < n < N 

(2.6.3) 



s o- i 
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Since V is square, it is also row-orthonormal, V • V T = 1. 

The SVD decomposition can also be carried out when M < N. In this case 
the singular values Wj for j = M + 1,..., N are all zero, and the corresponding 
columns of U are also zero. Equation (2.6.2) then holds only for k,n< M. 

The decomposition (2.6.1) can always be done, no matter how singular the 
matrix is, and it is “almost” unique. That is to say, it is unique up to (i) making 
the same permutation of the columns of U, elements of W, and columns of V (or 
rows of V T ), or (ii) forming linear combinations of any columns of U and V whose 
corresponding elements of W happen to be exactly equal. An important consequence 
of the permutation freedom is that for the case M < N, a numerical algorithm for 
the decomposition need not return zero Wj’s for j = M + 1, N; the N — M 
zero singular values can be scattered among all positions j = 1,2,..., N. 

At the end of this section, we give a routine, svdcmp, that performs SVD on 
an arbitrary matrix A, replacing it by U (they are the same shape) and giving back 
W and V separately. The routine svdcmp is based on a routine by Forsythe et 
al. [1 ], which is in turn based on the original routine of Golub and Reinsch, found, in 
various forms, in [2-4] and elsewhere. These references include extensive discussion 
of the algorithm used. As much as we dislike the use of black-box routines, we are 
going to ask you to accept this one, since it would take us too far afield to cover 
its necessary background material here. Suffice it to say that the algorithm is very 
stable, and that it is very unusual for it ever to misbehave. Most of the concepts that 
enter the algorithm (Householder reduction to bidiagonal form, diagonalization by 
QR procedure with shifts) will be discussed further in Chapter 11. 

If you are as suspicious of black boxes as we are, you will want to verify yourself 
that svdcmp does what we say it does. That is very easy to do: Generate an arbitrary 
matrix A, call the routine, and then verify by matrix multiplication that (2.6.1) and 
(2.6.4) are satisfied. Since these two equations are the only defining requirements 
for SVD, this procedure is (for the chosen A) a complete end-to-end check. 

Now let us find out what SVD is good for. 
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If the matrix A is square, N x N say, then U, V, and W are all square matrices 
of the same size. Their inverses are also trivial to compute: U and V are orthogonal, 
so their inverses are equal to their transposes; W is diagonal, so its inverse is the 
diagonal matrix whose elements are the reciprocals of the elements Wj . From (2.6.1) 
it now follows immediately that the inverse of A is 

A" 1 = Y • [diag • U T (2.6.5) 

The only thing that can go wrong with this construction is for one of the w/s 
to be zero, or (numerically) for it to be so small that its value is dominated by 
roundoff error and therefore unknowable. If more than one of the w j’s have this 
problem, then the matrix is even more singular. So, first of all, SVD gives you a 
clear diagnosis of the situation. 

Formally, the condition number of a matrix is defined as the ratio of the largest 
(in magnitude) of the Wj’s to the smallest of the w/$. A matrix is singular if its 
condition number is infinite, and it is ill-conditioned if its condition number is too 
large, that is, if its reciprocal approaches the machine’s floating-point precision (for 
example, less than 10 -6 for single precision or 10 -12 for double). 

For singular matrices, the concepts of nullspace and range are important. 
Consider the familiar set of simultaneous equations 

A x = b (2.6.6) 

where A is a square matrix, b and x are vectors. Equation (2.6.6) defines A as a 
linear mapping from the vector space x to the vector space b. If A is singular, then 
there is some subspace of x, called the nullspace, that is mapped to zero, A • x = 0. 
The dimension of the nullspace (the number of linearly independent vectors x that 
can be found in it) is called the nullity of A. 

Now, there is also some subspace of b that can be “reached” by A, in the sense 
that there exists some x which is mapped there. This subspace of b is called the range 
of A. The dimension of the range is called the rank of A. If A is nonsingular, then its 
range will be all of the vector space b, so its rank is N. If A is singular, then the rank 
will be less than N. In fact, the relevant theorem is “rank plus nullity equals TV.” 

What has this to do with SVD? SVD explicitly constructs orthonormal bases 
for the nullspace and range of a matrix. Specifically, the columns of U whose 
same-numbered elements Wj are nonzero are an orthonormal set of basis vectors that 
span the range; the columns of V whose same-numbered elements Wj are zero are 
an orthonormal basis for the nullspace. 

Now let’s have another look at solving the set of simultaneous linear equations 
(2.6.6) in the case that A is singular. First, the set of homogeneous equations, where 
b = 0, is solved immediately by SVD: Any column of V whose corresponding Wj 
is zero yields a solution. 

When the vector b on the right-hand side is not zero, the important question is 
whether it lies in the range of A or not. If it does, then the singular set of equations 
does have a solution x; in fact it has more than one solution, since any vector in 
the nullspace (any column of V with a corresponding zero w j) can be added to x 
in any linear combination. 
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If we want to single out one particular member of this solution-set of vectors as 
a representative, we might want to pick the one with the smallest length |x | 2 . Here is 
how to find that vector using SVD: Simply replace 1 /w j by zero ifwj = 0. (It is not 
very often that one gets to set oo = 0 !) Then compute (working from right to left) 

x = V • [diag (1 fwj)\ • (U T • b) (2.6.7) 

This will be the solution vector of smallest length; the columns of V that are in the 
nullspace complete the specification of the solution set. 

Proof: Consider |x + x'|, where x' lies in the nullspace. Then, if W _1 denotes 
the modified inverse of W with some elements zeroed, 

|x + x'| = (V-W^ 1 •U T -b + x'| 

= |v-(w~ 1 -u T -b + v T -x , )| ( 2 . 6 . 8 ) 

= jw” 1 - u T -b + y T -x'l 


Here the first equality follows from (2.6.7), the second and third from the orthonor¬ 
mality of Y. If you now examine the two terms that make up the sum on the 
right-hand side, you will see that the first one has nonzero j components only where 
Wj ^ 0, while the second one, since x' is in the nullspace, has nonzero j components 
only where Wj = 0. Therefore the minimum length obtains for x' = 0, q.e.d. 

If b is not in the range of the singular matrix A, then the set of equations (2.6.6) 
has no solution. But here is some good news: If b is not in the range of A, then 
equation (2.6.7) can still be used to construct a “solution” vector x. This vector x 
will not exactly solve A x = b. But, among all possible vectors x, it will do the 
closest possible job in the least squares sense. In other words (2.6.7) finds 

x which minimizes r = |A • x — b| (2.6.9) 

The number r is called the residual of the solution. 

The proof is similar to (2.6.8): Suppose we modify x by adding some arbitrary 
x'. Then A x — b is modified by adding some to' = A x'. Obviously b' is in 
the range of A. We then have 

|A • x — b + b'| = |(U • W • V T ) • (Y- W" 1 -U T -b) -b + b'| 

= |(U • W • W^ 1 • U T — 1) • b + b'| 

= |U- [(W • W” 1 - 1) • U T • b + U T • b'] | 

= |(W • W -1 — 1) • U T • b + U T • b'| 



Now, (W • W 1 — 1) is a diagonal matrix which has nonzero j components only for 
Wj = 0, while U T b / has nonzero j components only for Wj ^ 0, since b' lies in the 
range of A. Therefore the minimum obtains for b ' = 0, q.e.d. 

Figure 2.6.1 summarizes our discussion of SVD thus far. 
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space 



Figure 2.6.1. (a) A nonsingular matrix A maps a vector space into one of the same dimension. The 

vector x is mapped into b, so that x satisfies the equation A • x = b. (b) A singular matrix A maps a 
vector space into one of lower dimensionality, here a plane into a line, called the “range” of A. The 
“nullspace” of A is mapped to zero. The solutions of A • x = d consist of any one particular solution plus 
any vector in the nullspace, here forming a line parallel to the nullspace. Singular value decomposition 
(SVD) selects the particular solution closest to zero, as shown. The point c lies outside of the range 
of A, so A ■ x = c has no solution. SVD finds the least-squares best compromise solution, namely a 
solution of A • x = V, as shown. 


In the discussion since equation (2.6.6), we have been pretending that a matrix 
either is singular or else isn’t. That is of course true analytically. Numerically, 
however, the far more common situation is that some of the wf s are very small 
but nonzero, so that the matrix is ill-conditioned. In that case, the direct solution 
methods of LU decomposition or Gaussian elimination may actually give a formal 
solution to the set of equations (that is, a zero pivot may not be encountered); but 
the solution vector may have wildly large components whose algebraic cancellation, 
when multiplying by the matrix A, may give a very poor approximation to the 
right-hand vector b. In such cases, the solution vector x obtained by zeroing the 
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small w/s and then using equation (2.6.7) is very often better (in the sense of the 
residual |A • x — b| being smaller) than both the direct-method solution and the SVD 
solution where the small w/ s are left nonzero. 

It may seem paradoxical that this can be so, since zeroing a singular value 
corresponds to throwing away one linear combination of the set of equations that 
we are trying to solve. The resolution of the paradox is that we are throwing away 
precisely a combination of equations that is so corrupted by roundoff error as to be at 
best useless; usually it is worse than useless since it “pulls” the solution vector way 
off towards infinity along some direction that is almost a nullspace vector. In doing 
this, it compounds the roundoff problem and makes the residual |A • x — b| larger. 

SVD cannot be applied blindly, then. You have to exercise some discretion in 
deciding at what threshold to zero the small w/s, and/or you have to have some idea 
what size of computed residual |A • x — b| is acceptable. 

As an example, here is a “backsubstitution” routine svbksb for evaluating 
equation (2.6.7) and obtaining a solution vector x from a right-hand side b, given 
that the SVD of a matrix A has already been calculated by a call to svdcmp. Note 
that this routine presumes that you have already zeroed the small w/s. It does not 
do this for you. If you haven’t zeroed the small w/s, then this routine is just as 
ill-conditioned as any direct method, and you are misusing SVD. 

#include "nrutil.h" 

void svbksbffloat **u, float w[], float **v, int m, int n, float b[], float x[]) 
Solves A X = B for a vector X, where A is specified by the arrays u[l. .m] [1. .n] , w [1. .n] , 
v[l. .n] [1. .n] as returned by svdcmp. m and n are the dimensions of a, and will be equal for 
square matrices. b[l..m] is the input right-hand side. x[l..n] is the output solution vector. 
No input quantities are destroyed, so the routine may be called sequentially with different b’s. 

{ 

int jj.j.i; 

float s,*tmp; 

tmp=vector(l,n); 

for (j=l;j<=n;j++) { Calculate U T B. 

s=0.0; 

if (w[jD { Nonzero result only if Wj is nonzero, 

for (i=l;i<=m;i++) s += u[i] [j]*b[i] ; 
s /= w[j] ; This is the divide by Wj. 

> 

tmp[j]=s; 

> 

for (j=l;j<=n;j++) { Matrix multiply by V to get answer. 

s=0.0; 

for (jj=l; jj<=n; jj++) s += v[j] [jj]*tmp[jj] ; 

x[j]=s; 

> 

free vectorftmp,1,n); 

> 



Note that a typical use of svdcmp and svbksb superficially resembles the 
typical use of ludcmp and lubksb: In both cases, you decompose the left-hand 
matrix A just once, and then can use the decomposition either once or many times 
with different right-hand sides. The crucial difference is the “editing” of the singular 
values before svbksb is called: 
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#define N ... 

float wmax,wmin,**a,**u,*w,**v,*b,*x; 
int i,j ; 


Copy a into u if you don't want it to be de¬ 
stroyed . 


for (i=l; i<=N; i++) 

for j=l;j<=M;j++) 
u[i] [j]=a[i] [j]; 
svdcmpCu.N.N.w.v) ; SVD the square matrix a. 

wmax=0.0; Will be the maximum singular value obtained, 

for (j=l;j<=N;j++) if (w[j] > wmax) wmax=w[j] ; 

This is where we set the threshold for singular values allowed to be nonzero. The constant 
is typical, but not universal. You have to experiment with your own application. 
wmin=wmax* 1. Oe-6; 

for (j=l;j<=N;j++) if (w[j] < wmin) w[j]=0.0; 
svbksbCujW.v.M.M.bjX); Now we can backsubstitute. 


SVD for Fewer Equations than Unknowns 


If you have fewer linear equations M than unknowns N, then you are not 
expecting a unique solution. Usually there will be an N — M dimensional family 
of solutions. If you want to find this whole solution space, then SVD can readily 
do the job. 

The SVD decomposition will yield N — M zero or negligible wf s, since 
M < N. There may be additional zero w/s from any degeneracies in your M 
equations. Be sure that you find this many small Wj’s, and zero them before calling 
svbksb, which will give you the particular solution vector x. As before, the columns 
of V corresponding to zeroed Wj’s are the basis vectors whose linear combinations, 
added to the particular solution, span the solution space. 


SVD for More Equations than Unknowns 


This situation will occur in Chapter 15, when we wish to find the least-squares 
solution to an overdetermined set of linear equations. In tableau, the equations 
to be solved are 


( \ (\ 


A 


/ \ 

x = b 

V / 


V ) \) 


( 2 . 6 . 11 ) 



The proofs that we gave above for the square case apply without modification 
to the case of more equations than unknowns. The least-squares solution vector x is 
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given by (2.6.7), which, with nonsquare matrices, looks like this. 



( 2 . 6 . 12 ) 


In general, the matrix W will not be singular, and no Wj’s will need to be 
set to zero. Occasionally, however, there might be column degeneracies in A. In 
this case you will need to zero some small Wj values after all. The corresponding 
column in V gives the linear combination of x’s that is then ill-determined even by 
the supposedly overdetermined set. 

Sometimes, although you do not need to zero any wj’s for computational 
reasons, you may nevertheless want to take note of any that are unusually small: 
Their corresponding columns in V are linear combinations of x’s which are insensitive 
to your data. In fact, you may then wish to zero these Wj’s, to reduce the number of 
free parameters in the fit. These matters are discussed more fully in Chapter 15. 

Constructing an Orthonormal Basis 

Suppose that you have N vectors in an M-dimensional vector space, with 
N < M. Then the N vectors span some subspace of the full vector space. 
Often you want to construct an orthonormal set of N vectors that span the same 
subspace. The textbook way to do this is by Gram-Schmidt orthogonalization, 
starting with one vector and then expanding the subspace one dimension at a 
time. Numerically, however, because of the build-up of roundoff errors, naive 
Gram-Schmidt orthogonalization is terrible. 

The right way to construct an orthonormal basis for a subspace is by SVD: 
Form an M x N matrix A whose N columns are your vectors. Run the matrix 
through svdcmp. The columns of the matrix U (which in fact replaces A on output 
from svdcmp) are your desired orthonormal basis vectors. 

You might also want to check the output w/s for zero values. If any occur, 
then the spanned subspace was not, in fact, N dimensional; the columns of U 
corresponding to zero Wj’s should be discarded from the orthonormal basis set. 

(QR factorization, discussed in §2.10, also constructs an orthonormal basis, 
see [5].) 

Approximation of Matrices 


Note that equation (2.6.1) can be rewritten to express any matrix A ij as a sum 
of outer products of columns of U and rows of V T , with the “weighting factors” 
being the singular values Wj, 

N 


A n = E 


(2.6.13) 


k=\ 



Wk U ik V jk 
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If you ever encounter a situation where most of the singular values Wj of a 
matrix A are very small, then A will be well-approximated by only a few terms in the 
sum (2.6.13). This means that you have to store only a few columns of U and V (the 
same k ones) and you will be able to recover, with good accuracy, the whole matrix. 

Note also that it is very efficient to multiply such an approximated matrix by a 
vector x: You just dot x with each of the stored columns of Y, multiply the resulting 
scalar by the corresponding Wk, and accumulate that multiple of the corresponding 
column of U. If your matrix is approximated by a small number K of singular 
values, then this computation of A • x takes only about K(M + N ) multiplications, 
instead of MN for the full matrix. 

SVD Algorithm 

Here is the algorithm for constructing the singular value decomposition of any 
matrix. See §11.2—§11.3, and also [4-5], for discussion relating to the underlying 
method. 

#include <math.h> 

#include "nrutil.h" 

void svdcmp(float **a, int m, int n, float w[], float **v) 

Given a matrix a[l. .m] [1. .n] , this routine computes its singular value decomposition, A = 
U-W-V T . The matrix U replaces a on output. The diagonal matrix of singular values W is out¬ 
put as a vector w[l. .n] . The matrix V (not the transpose V T ) is output as v[l. .n] [1. .n] . 
{ 

float pythag(float a, float b); 

int flag,i,its,j,jj,k,l,mn; 

float anorm,c,f,g,h,s,scale,x,y,z,*rvl; 

rvl=vector(l,n); 

g=scale=anorm=0.0; Householder reduction to bidiagonal form, 

for (i=l;i<=n;i++) { 
l=i+l; 

rvl[i]=scale*g; 
g=s=scale=0.0; 
if (i <= m) { 

for (k=i;k<=m;k++) scale += fabs(a[k][i]); 
if (scale) { 

for (k=i;k<=m;k++) { 
a[k] [i] /= scale; 
s += a[k] [i] *a[k] [i] ; 

> 

f=a[i] [i] ; 

g = -SIGN(sqrt(s),f); 

h=f*g-s; 

a[i] [i]=f-g; 

for (j=l;j<=n;j++) { 

for (s=0.0,k=i;k<=m;k++) s += a[k] [i]*a[k] [j] ; 
f=s/h; 

for (k=i;k<=m;k++) a[k] [j] += f*a[k][i]; 

> 

for (k=i;k<=m;k++) a[k][i] *= scale; 

> 

> 

w[i]=scale *g; 
g=s=scale=0.0; 
if (i <= m kk i != n) { 

for (k=l;k<=n;k++) scale += fabs(a[i][k]); 
if (scale) { 
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for (k=l;k<=n;k++) { 
a[i] [k] /= scale; 
s += a[i] [k] *a[i] [k] ; 

> 

f=a[i] [1]; 

g = -SIGN(sqrt(s),f); 

h=f*g-s; 

a[i] [l]=f-g; 

for (k=l;k<=n;k++) rvl [k]=a[i] [k]/h; 
for (j=l;j<=m;j++) { 

for (s=0.0,k=l;k<=n;k++) s += a[j][k]*a[i][k]; 
for (k=l;k<=n;k++) a[j] [k] += s*rvl[k]; 

> 

for (k=l;k<=n;k++) a[i][k] *= scale; 

> 

> 

anorm=FMAX(anorm,(fabs(w[i])+fabs(rvl[i]))); 

> 

for (i=n;i>=l;i—) { Accumulation of right-hand transformations, 

if (i < n) { 
if (g) { 

for (j=l; j<=n; j++) Double division to avoid possible underflow. 

v[j] [i] = (a[i] [j]/a[i] [l])/g; 
for (j=l;j<=n;j++) { 

for (s=0.0,k=l;k<=n;k++) s += a[i][k]*v[k][j]; 
for (k=l;k<=n;k++) v[k] [j] += s*v[k][i]; 

> 

> 

for (j=l; j<=n; j++) v[i] [j]=v[j] [i]=0.0; 

> 

v[i] [i] =1.0; 
g=rvl [i] ; 


> 

for 


> 

for 


(i=IMIN(m,n) ;i>=l;i—) { Accumulation of left-hand transformations. 

l=i+l; 
g=w [i] ; 

for (j=l;j<=n;j++) a[i][j]=0.0; 
if (g) { 

g=l-0/g; 

for (j=l;j<=n;j++) { 

for (s=0.0,k=l;k<=m;k++) s += a[k] [i]*a[k] [j] ; 
f=(s/a[i][i])*g; 

for (k=i;k<=m;k++) a[k] [j] += f*a[k] [i] ; 

> 

for (j=i; j<=m; j++) a[j] [i] *= g; 

> else for (j=i; j<=m; j++) a[j] [i]=0.0; 

++a[i] [i] ; 


(k=i 

for 


n;k>=l;k—) { Diagonalization of the bidiagonal form: Loop over 

(its=l;its<=30;its++) { singular values, and over allowed iterations. 


flag=l; 

for (l=k;l>=l;l—) { Test for splitting. 

nm=l-l; Note that rvl [1] is always zero, 

if ((float)(fabs(rvl[1])+anorm) == anorm) { 


flag=0; 
break; 

> 

if ((float)(fabs(w[nm])+anorm) == anorm) break; 

> 

if (flag) { 
c=0.0; 
s=l.0; 

for (i=l;i<=k;i++) { 



Cancellation of rvl[l] , if 1 > 1. 


imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



2.6 Singular Value Decomposition 


69 


f=s*rvl [i] ; 
rvl[i]=c*rvl [i] ; 

if ((float)(fabs(f)+anorm) == anorm) break; 
g=w [i] ; 

h=pythag(f,g); 
w[i]=h; 
h=l,0/h; 
c=g*h; 
s = -f*h; 

for (j=l;j<=m;j++) { 
y=a[j] [nm] ; 
z=a[j] [i] ; 
a[j] [nm]=y*c+z*s; 
a[j] [i] =z*c-y*s; 

> 

> 

> 

z=w [k] ; 

if (1 == k) { Convergence. 

if (z < 0.0) { Singular value is made nonnegative. 

w[k] = -z; 

for (j=l;j<=n;j++) v[j] [k] = -v[j] [k] ; 

> 

break; 

> 

if (its == 30) nrerror("no convergence in 30 svdcmp iterations"); 

x=w[l] ; Shift from bottom 2-by-2 minor. 

nm=k-l; 

y=w [nm] ; 

g=rvl[nm]; 

h=rvl [k] ; 

f=((y-z)*(y+z)+(g-h)*(g+h))/(2.0*h*y); 
g=pythag(f,1.0); 

f=((x-z)*(x+z)+h*((y/(f+SIGN(g,f)))-h))/x; 
c=s=1.0; Next QR transformation: 

for (j = 1; j <=nm; j ++) { 

i=j+l; 

g=rvl [i] ; 
y=w[i] ; 
h=s*g; 
g=c*g; 

z=pythag(f,h); 
rvl [j]=z; 
c=f/z; 
s=h/z; 
f=x*c+g*s; 
g = g*c-x*s; 
h=y*s; 
y *= c; 

for (jj=l;jj<=n;jj++) { 

x= v[jj] [j] ; 

z=v[jj] [i] ; 
v[jj] [j]=x*c+z*s; 
v[j j] [i] =z*c-x*s; 

> 

z=pythag(f,h); 

w[j]=z; 

if (z) { 

z=l,0/z; 
c=f*z; 
s=h*z; 

> 

f=c*g+s*y; 
x=c*y-s*g; 



Rotation can be arbitrary if z = 0. 
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> 


for (jj=l;jj<=m;jj ++ ) 

y=a[jj] [j]; 

z=a[jj] [i] ; 
a[j j] [j] =y*c+z*s; 
a[jj] [i] =z*c-y*s; 

> 

> 

rvl [1]=0.0; 
rvl [k] =f; 
w [k] =x; 

> 

> 

free_vector(rvl,l,n); 


t 


#include <math.h> 

#include "nrutil.h" 

float pythag(float a, float b) 

Computes (a 2 +6 2 ) 1 / 2 without destructive underflow or overflow. 

{ 

float absa.absb; 
absa=fabs(a); 
absb=fabs(b); 

if (absa > absb) return absa*sqrt(1.0+SQR(absb/absa)); 

else return (absb == 0.0 ? 0.0 : absb*sqrt(1.0+SQR(absa/absb))); 

> 


(Double precision versions of svdcmp, svbksb, and pythag, named dsvdcmp, 
dsvbksb, and dpythag, are used by the routine ratlsq in §5.13. You can easily 
make the conversions, or else get the converted routines from the Numerical Recipes 
diskette.) 
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2.7 Sparse Linear Systems 

A system of linear equations is called sparse if only a relatively small number 
of its matrix elements a tJ are nonzero. It is wasteful to use general methods of 
linear algebra on such problems, because most of the 0(N 3 ) arithmetic operations 
devoted to solving the set of equations or inverting the matrix involve zero operands. 
Furthermore, you might wish to work problems so large as to tax your available 
memory space, and it is wasteful to reserve storage for unfruitful zero elements. 
Note that there are two distinct (and not always compatible) goals for any sparse 
matrix method: saving time and/or saving space. 

We have already considered one archetypal sparse form in §2.4, the band 
diagonal matrix. In the tridiagonal case, e.g., we saw that it was possible to save 
both time (order N instead of N 3 ) and space (order N instead of N 2 ). The 
method of solution was not different in principle from the general method of LU 
decomposition; it was just applied cleverly, and with due attention to the bookkeeping 
of zero elements. Many practical schemes for dealing with sparse problems have this 
same character. They are fundamentally decomposition schemes, or else elimination 
schemes akin to Gauss-Jordan, but carefully optimized so as to minimize the number 
of so-called fill-ins , initially zero elements which must become nonzero during the 
solution process, and for which storage must be reserved. 

Direct methods for solving sparse equations, then, depend crucially on the 
precise pattern of sparsity of the matrix. Patterns that occur frequently, or that are 
useful as way-stations in the reduction of more general forms, already have special 
names and special methods of solution. We do not have space here for any detailed 
review of these. References listed at the end of this section will furnish you with an 
“in” to the specialized literature, and the following list of buzz words (and Figure 
2.7.1) will at least let you hold your own at cocktail parties: 

• tridiagonal 

• band diagonal (or banded) with bandwidth M 

• band triangular 

• block diagonal 

• block tridiagonal 

• block triangular 

• cyclic banded 

• singly (or doubly) bordered block diagonal 

• singly (or doubly) bordered block triangular 

• singly (or doubly) bordered band diagonal 

• singly (or doubly) bordered band triangular 

• other (!) 

You should also be aware of some of the special sparse forms that occur in the 
solution of partial differential equations in two or more dimensions. See Chapter 19. 

If your particular pattern of sparsity is not a simple one, then you may wish to 
try an analyze/factorize/operate package, which automates the procedure of figuring 
out how fill-ins are to be minimized. The analyze stage is done once only for each 
pattern of sparsity. The factorize stage is done once for each particular matrix that 
fits the pattern. The operate stage is performed once for each right-hand side to 
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Figure 2.7.1. Some standard forms for sparse matrices, (a) Band diagonal; (b) block triangular; (c) block 
tridiagonal; (d) singly bordered block diagonal; (e) doubly bordered block diagonal; (f) singly bordered 
block triangular; (g) bordered band-triangular; (h) and (i) singly and doubly bordered band diagonal; (j) 
and (k) other! (after Tewarson) [1 ]. 



be used with the particular matrix. Consult [2,3] for references on this. The NAG 
library [4] has an analyze/factorize/operate capability. A substantial collection of 
routines for sparse matrix calculation is also available from IMSL [5] as the Yale 
Sparse Matrix Package [6], 

You should be aware that the special order of interchanges and eliminations, 
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prescribed by a sparse matrix method so as to minimize fill-ins and arithmetic 
operations, generally acts to decrease the method’s numerical stability as compared 
to, e.g., regular LU decomposition with pivoting. Scaling your problem so as to 
make its nonzero matrix elements have comparable magnitudes (if you can do it) 
will sometimes ameliorate this problem. 

In the remainder of this section, we present some concepts which are applicable 
to some general classes of sparse matrices, and which do not necessarily depend on 
details of the pattern of sparsity. 

Sherman-Morrison Formula 

Suppose that you have already obtained, by herculean effort, the inverse matrix 
A 1 of a square matrix A. Now you want to make a “small” change in A, for 
example change one element a^ , or a few elements, or one row, or one column. 
Is there any way of calculating the corresponding change in A _1 without repeating 
your difficult labors? Yes, if your change is of the form 

A —» (A + u®v) (2.7.1) 

for some vectors u and v. If u is a unit vector e», then (2.7.1) adds the components 
of v to the ith row. (Recall that u ® v is a matrix whose i, j\h element is the product 
of the ith component of u and the jth component of v.) If v is a unit vector e 3 , then 
(2.7.1) adds the components of u to the jth column. If both u and v are proportional 
to unit vectors e, and e, respectively, then a term is added only to the element fly. 

The Sherman-Morrison formula gives the inverse (A + u 0 v) _1 , and is derived 
briefly as follows: 

(A + u 0 v) -1 = (1 + A^ 1 • u 0 v) _1 • A -1 

= (1 — A -1 • u ® v + A -1 • u 0 v • A -1 • u 0 v — ...) • A -1 
= A -1 — A -1 • u ® v • A~ x (1 — A + A 2 — ...) 

_ ! (A -1 • u) ® (v • A -1 ) 

1 +A 

(2.7.2) 

where 

A = v • A -1 • u (2.7.3) 

The second line of (2.7.2) is a formal power series expansion. In the third line, the 
associativity of outer and inner products is used to factor out the scalars A. 

The use of (2.7.2) is this: Given A -1 and the vectors u and v, we need only 
perform two matrix multiplications and a vector dot product, 

z = A _1 u w=(A _1 ) t -v A = v-z (2-7.4) 



to get the desired change in the inverse 


A" 1 


z 0 w 

TTa 


(2.7.5) 
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The whole procedure requires only 37V 2 multiplies and a like number of adds (an 
even smaller number if u or v is a unit vector). 

The Sherman-Morrison formula can be directly applied to a class of sparse 
problems. If you already have a fast way of calculating the inverse of A (e.g., a 
tridiagonal matrix, or some other standard sparse form), then (2.7.4)-(2.7.5) allow 
you to build up to your related but more complicated form, adding for example a 
row or column at a time. Notice that you can apply the Sherman-Morrison formula 
more than once successively, using at each stage the most recent update of A -1 
(equation 2.7.5). Of course, if you have to modify every row, then you are back to 
an TV 3 method. The constant in front of the TV 3 is only a few times worse than the 
better direct methods, but you have deprived yourself of the stabilizing advantages 
of pivoting — so be careful. 

For some other sparse problems, the Sherman-Morrison formula cannot be 
directly applied for the simple reason that storage of the whole inverse matrix A 
is not feasible. If you want to add only a single correction of the form u (g> v, 
and solve the linear system 

(A + u ® v) • x = b (2.7.6) 

then you proceed as follows. Using the fast method that is presumed available for 
the matrix A, solve the two auxiliary problems 

A • y = b A z = u (2.7.7) 

for the vectors y and z. In terms of these, 


as we see by multiplying (2.7.2) on the right by b. 

Cyclic Tridiagonal Systems 



So-called cyclic tridiagonal systems occur quite frequently, and are a good 
example of how to use the Sherman-Morrison formula in the manner just described. 
The equations have the form 

b\ ci 0 (3 1 r x\ I r n - 

02 62 C2 • • • X2 V2 

■ ••• = ••• (2.7.9) 

••• ojv-i 6jv-i cjv-i xn-i fjv-i 

a 0 ajv 6 jv J L xn J L rjv - 

This is a tridiagonal system, except for the matrix elements a and (3 in the comers. 
Forms like this are typically generated by finite-differencing differential equations 
with periodic boundary conditions (§19.4). 

We use the Sherman-Morrison formula, treating the system as tridiagonal plus 
a correction. In the notation of equation (2.7.6), define vectors u and v to be 



(2.7.10) 
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Here 7 is arbitrary for the moment. Then the matrix A is the tridiagonal part of the 
matrix in (2.7.9), with two terms modified: 

b\ = 61 - 7 , b' N = b N - a/ 3/7 (2.7.11) 


We now solve equations {2.1.1) with the standard tridiagonal algorithm, and then 
get the solution from equation (2.7.8). 

The routine cyclic below implements this algorithm. We choose the arbitrary 
parameter 7 = —b\ to avoid loss of precision by subtraction in the first of equations 
(2.7.11). In the unlikely event that this causes loss of precision in the second of 
these equations, you can make a different choice. 


#include "nrutil.h" 


void cyclic(float a[] , float b[], float c[], float alpha, float beta, 
float r [] , float x [] , unsigned long n) 

Solves for a vector x[l. .n] the "cyclic" set of linear equations given by equation (2.7.9). a, 
b, c, and r are input vectors, all dimensioned as [1. .n] , while alpha and beta are the corner 
entries in the matrix. The input is not modified. 

{ 

void tridag(float a[] , float b[], float c[], float r[], float u[] , 
unsigned long n); 
unsigned long i; 
float fact,gamma,*bb,*u,*z; 


if (n <= 2) nrerrorC'n too small in cyclic"); 


bb=vector(l,n); 

u=vector(l,n); 

z=vector(l,n); 

gamma = -b [1] ; 

bb[l] =b [1] -gamma; 

bb [n] =b [n] -alpha*beta/gamma; 

for (i=2;i<n;i++) bb[i]=b[i]; 

tridag(a,bb,c,r,x,n); 

u[l]=gamma; 

u [n] =alpha; 

for (i=2;i<n;i++) u[i]=0.0; 
tridag(a,bb,c,u,z,n); 
f act= (x [1] +beta*x [n] /gamma) / 

(1.0+z [1] +beta*z [n] /gamma) ; 
for (i=l;i<=n;i++) x[i] -= fact*z[i]; 
free_vector(z,l,n); 
free_vector(u,1,n); 
free_vector(bb,1,n); 


Avoid subtraction error in forming bb [1]. 
Set up the diagonal of the modified tridi¬ 
agonal system. 

Solve A • x = r. 

Set up the vector u. 


Solve A • z = u. 

Form v • x/(l + v • z). 

Now get the solution vector x. 


Woodbury Formula 

If you want to add more than a single correction term, then you cannot use (2.7.8) 
repeatedly, since without storing a new A -1 you will not be able to solve the auxiliary 
problems (2.7.7) efficiently after the first step. Instead, you need the Woodbury formula, which 
is the block-matrix version of the Sherman-Morrison formula, 

(A + U- V T ) _1 

= A- J - 



[A- 1 • U • (1 + V T • A” 1 • U) _1 • V T • A- 1 ] 


(2.7.12) 
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Here A is, as usual, an N x TV matrix, while U and V are IV x P matrices with P < N 
and usually P <C IV. The inner piece of the correction term may become clearer if written 
as the tableau, 



where you can see that the matrix whose inverse is needed is only P x P rather than N x N. 

The relation between the Woodbury formula and successive applications of the Sherman- 
Morrison formula is now clarified by noting that, if U is the matrix formed by columns out of the 
P vectors ui,..., up, and V is the matrix formed by columns out of the P vectors \i,... ,\p. 




then two ways of expressing the same correction to A are 


A + ^2 u fc ® \ 


(Note that the subscripts on u and v do not denote components, but rather distinguish the 
different column vectors.) 

Equation (2.7.15) reveals that, if you have A -1 in storage, then you can either make the 
P corrections in one fell swoop by using (2.7.12), inverting a P x P matrix, or else make 
them by applying (2.7.5) P successive times. 

If you don’t have storage for A -1 , then you must use (2.7.12) in the following way: 
To solve the linear equation 


A + ^Ufc®v fc J • x = b 


first solve the P auxiliary problems 


and construct the matrix Z by columns from the z’s obtained, 


z m 



Next, do the P x P matrix inversion 


h = (i + v T -zr 
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Finally, solve the one further auxiliary problem 

A y = b (2.7.20) 

In terms of these quantities, the solution is given by 

x = y — Z • [h • (V T • y)j (2.7.21) 

Inversion by Partitioning 

Once in a while, you will encounter a matrix (not even necessarily sparse) 
that can be inverted efficiently by partitioning. Suppose that the N x N matrix 
A is partitioned into 


A = 


P Q 
R S 


(2.7.22) 


where P and S are square matrices of size p x p and s x s respectively (p + s = N). 
The matrices Q and R are not necessarily square, and have sizes p x s and s x p, 
respectively. 

If the inverse of A is partitioned in the same manner. 


A - 1 


p 9 

R S 


(2.7.23) 


then P, Q, R, S, which have the same sizes as P, Q, R, S, respectively, can be 
found by either the formulas 

P= (P Q S 1 R) 1 

Q = _(p - Q S 1 R)- 1 • (Q • S” 1 ) 

(2.7.24) 

R = -(S" 1 R) (P - Q S" 1 R)- 1 
S = S -1 + (S -1 R) (P - Q S 1 R)- 1 • (Q • S" 1 ) 


or else by the equivalent formulas 

P = P 1 + (P _1 Q) (S - R P 1 Q)" 1 • (R • P _1 ) 
Q = (P 1 Q) (S R P 1 Q)" 1 
R = -(S R P 1 Q) 1 (R P 1 ) 

S = (S - R - P _1 • Q) -1 



The parentheses in equations (2.7.24) and (2.7.25) highlight repeated factors that 
you may wish to compute only once. (Of course, by associativity, you can instead 
do the matrix multiplications in any order you like.) The choice between using 
equation (2.7.24) and (2.7.25) depends on whether you want P or S to have the 
simpler formula; or on whether the repeated expression (S — R • P • Q) -1 is easier 
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to calculate than the expression (P — Q S -1 • R) _1 ; or on the relative sizes of P 
and S; or on whether P 1 or S _1 is already known. 

Another sometimes useful formula is for the determinant of the partitioned 
matrix, 

detA = detPdet(S-R-P~ 1 • Q) = detSdet(P- Q • S' 1 • R) (2.7.26) 


Indexed Storage of Sparse Matrices 

We have already seen (§2.4) that tri- or band-diagonal matrices can be stored in a compact 
format that allocates storage only to elements which can be nonzero, plus perhaps a few wasted 
locations to make the bookkeeping easier. What about more general sparse matrices? When a 
sparse matrix of dimension N x N contains only a few times N nonzero elements (a typical 
case), it is surely inefficient — and often physically impossible — to allocate storage for all 
N 2 elements. Even if one did allocate such storage, it would be inefficient or prohibitive in 
machine time to loop over all of it in search of nonzero elements. 

Obviously some kind of indexed storage scheme is required, one that stores only nonzero 
matrix elements, along with sufficient auxiliary information to determine where an element 
logically belongs and how the various elements can be looped over in common matrix 
operations. Unfortunately, there is no one standard scheme in general use. Knuth [7] describes 
one method. The Yale Sparse Matrix Package [6] and ITPACK [8] describe several other 
methods. For most applications, we favor the storage scheme used by PCGPACK [9], which 
is almost the same as that described by Bentley [10], and also similar to one of the Yale Sparse 
Matrix Package methods. The advantage of this scheme, which can be called row-indexed 
sparse storage mode, is that it requires storage of only about two times the number of nonzero 
matrix elements. (Other methods can require as much as three or five times.) For simplicity, 
we will treat only the case of square matrices, which occurs most frequently in practice. 

To represent a matrix A of dimension N x N, the row-indexed scheme sets up two 
one-dimensional arrays, call them sa and i j a. The first of these stores matrix element values 
in single or double precision as desired; the second stores integer values. The storage rules are: 

• The first N locations of sa store A’s diagonal matrix elements, in order. (Note that 
diagonal elements are stored even if they are zero; this is at most a slight storage 
inefficiency, since diagonal elements are nonzero in most realistic applications.) 

• Each of the first N locations of ija stores the index of the array sa that contains 
the first off-diagonal element of the corresponding row of the matrix. (If there are 
no off-diagonal elements for that row, it is one greater than the index in sa of the 
most recently stored element of a previous row.) 

• Location 1 of ija is always equal to N + 2. (It can be read to determine N.) 

• Location N + 1 of ija is one greater than the index in sa of the last off-diagonal 
element of the last row. (It can be read to determine the number of nonzero 
elements in the matrix, or the number of elements in the arrays sa and ija.) 
Location N + 1 of sa is not used and can be set arbitrarily. 

• Entries in sa at locations > N + 2 contain A’s off-diagonal values, ordered by 
rows and, within each row, ordered by columns. 

• Entries in i j a at locations > N+ 2 contain the column number of the corresponding 
element in sa. 

While these rules seem arbitrary at first sight, they result in a rather elegant storage 
scheme. As an example, consider the matrix 


■ 3 . 

0 . 

1 . 

0 . 

0 .' 

0 . 

4. 

0 . 

0 . 

0 . 

0 . 

7. 

5. 

9. 

0 . 

0 . 

0 . 

0 . 

0 . 

2. 

0 . 

0 . 

0 . 

6. 

5. 



(2.7.27) 
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In row-indexed compact storage, matrix (2.7.27) is represented by the two arrays of length 
11, as follows 


index k 

1 

2 

3 

4 

5 

6 

7 

8 

9 

10 

11 

i j a [k] 

7 

8 

8 

10 

11 

12 

3 

2 

4 

5 

4 

sa[k] 

3. 

4. 

5. 

0 . 

5. 

X 

1 . 

7. 

9. 

2. 

6. 


Here x is an arbitrary value. Notice that, according to the storage rules, the value of N 
(namely 5) is ija[l]-2, and the length of each array is ija[ija[l]-1]-1, namely 11. 
The diagonal element in row i is sa [i], and the off-diagonal elements in that row are in 
sa[k] where k loops from ija[i] to ija[i+l] -1, if the upper limit is greater or equal to 
the lower one (as in C’s for loops). 

Here is a routine, sprsin, that converts a matrix from full storage mode into row-indexed 
sparse storage mode, throwing away any elements that are less than a specified threshold. 
Of course, the principal use of sparse storage mode is for matrices whose full storage mode 
won’t fit into your machine at all; then you have to generate them directly into sparse format. 
Nevertheless sprsin is useful as a precise algorithmic definition of the storage scheme, for 
subscale testing of large problems, and for the case where execution time, rather than storage, 
furnishes the impetus to sparse storage. 


#include <math.h> 

void sprsin(float **a, int n, float thresh, unsigned long nmax, float sa[], 
unsigned long ija[]) 

Converts a square matrix a[l. .n] [1. .n] into row-indexed sparse storage mode. Only ele¬ 
ments of a with magnitude >thresh are retained. Output is in two linear arrays with dimen¬ 
sion nmax (an input parameter): sa[l. .] contains array values, indexed by ija[l. .] . The 
number of elements filled of sa and ija on output are both ija[ija[l]-1]-1 (see text), 
f 

void nrerror(char error_text[]); 
int i, j ; 

unsigned long k; 

for (j=l; j<=n; j++) sa[j]=a[j] [j] ; Store diagonal elements. 

ija[l]=n+2; Index to 1st row off-diagonal element, if any. 

k=n+l; 

for (i=l;i<=n;i++) { Loop over rows, 

for (j=l;j<=n; j++) { Loop over columns, 

if (fabs(a[i] [j]) >= thresh && i != j) { 

if (++k > nmax) nrerror("sprsin: nmax too small"); 

sa[k]=a[i] [j] ; Store off-diagonal elements and their columns. 

ija[k] =j; 

> 

} 

ija[i+l]=k+l; As each row is completed, store index to 

> next. 

> 


The single most important use of a matrix in row-indexed sparse storage mode is to 
multiply a vector to its right. In fact, the storage mode is optimized for just this purpose. 
The following routine is thus very simple. 

void sprsax(float sa[] , unsigned long ija[] , float x[], float b[], 
unsigned long n) 

Multiply a matrix in row-index sparse storage arrays sa and ija by a vector x[l. .n], giving 
a vector b[1. .n]. 

f 

void nrerror(char error_text[]); 
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unsigned long i,k; 

if (ija[l] != n+2) nrerror("sprsax: mismatched vector and matrix"); 
for (i=l;i<=n;i++) { 

b[i]=sa[i]*x[i] ; Start with diagonal term, 

for (k=ija[i] ;k<=ija[i+l]-l;k++) Loop over off-diagonal terms. 

b[i] += sa [k] *x [i j a [k] ] ; 


> 


> 


It is also simple to multiply the transpose of a matrix by a vector to its right. (We will use 
this operation later in this section.) Note that the transpose matrix is not actually constructed. 

void sprstx(float sa[] , unsigned long ija[] , float x[], float b[], 
unsigned long n) 

Multiply the transpose of a matrix in row-index sparse storage arrays sa and ija by a vector 
x[l. .n] , giving a vector b [1. .n] . 

1 

void nrerror(char error_text []); 
unsigned long i,j,k; 

if (ija[l] != n+2) nrerror("mismatched vector and matrix in sprstx"); 
for (i=l;i<=n;i++) b [i] =sa[i] *x[i] ; Start with diagonal terms, 

for (i=l;i<=n;i++) { Loop over off-diagonal terms, 

for (k=ija[i] ;k<=ija[i+l]-l;k++) { 
j=ija[k] ; 

b[j] += sa[k] *x[i] ; 

> 

> 


(Double precision versions of sprsax and sprstx, named dsprsax and dsprstx, are used 
by the routine atimes later in this section. You can easily make the conversion, or else get 
the converted routines from the Numerical Recipes diskettes.) 

In fact, because the choice of row-indexed storage treats rows and columns quite 
differently, it is quite an involved operation to construct the transpose of a matrix, given the 
matrix itself in row-indexed sparse storage mode. When the operation cannot be avoided, it 
is done as follows: An index of all off-diagonal elements by their columns is constructed 
(see §8.4). The elements are then written to the output array in column order. As each 
element is written, its row is determined and stored. Finally, the elements in each column 
are sorted by row. 

void sprstp(float sa[] , unsigned long ija[] , float sb[], unsigned long ijb[]) 
Construct the transpose of a sparse square matrix, from row-index sparse storage arrays sa and 
ija into arrays sb and ijb. 
f 

void iindexx(unsigned long n, long arr[], unsigned long indx[]); 

Version of indexx with all float variables changed to long. 

unsigned long j,jl,jm,jp,ju,k,m,n2,noff,inc,iv; 

float v; 

n2=ija[l] ; Linear size of matrix plus 2. 

for (j=l;j<=n2-2;j++) sb[j]=sa[j]; Diagonal elements. 

iindexx(ija[n2-l]-ija[l] , (long *)&ija[n2-l] ,&ijb[n2-l]); 

Index all off-diagonal elements by their columns. 

jp=o; 

for (k=ija[l] ;k<=ija[n2-l]-l;k++) { Loop over output off-diagonal elements. 

m=i jb [k] +n2-l; Use index table to store by (former) columns. 

sb[k]=sa[m] ; 

for (j=jp+l; j<=ija[m] ; j++) ijb[j]=k; 



Fill in the index to any omitted rows. 


imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 





2.7 Sparse Linear Systems 


81 


jp=ija[m] ; Use bisection to find which row element 

jl=l; m is in and put that into ijb[k], 

ju=n2-l; 

while (ju-jl > 1) { 
jm=(ju+jl)/2; 

if (ija[jm] > m) ju=jm; else jl=jm; 

> 

ijb[k]=jl; 

> 

for (j=jp+l; j<n2; j++) ijb[j]=ija[n2-l] ; 

for (j=l; j<=n2-2; j++) { Make a final pass to sort each row by 

jl=ijb[j+l]-ijb[j] ; Shell sort algorithm. 

noff=ijb[j]-l; 
inc=l; 
do { 

inc *= 3; 
inc++; 

> while (inc <= jl); 
do { 

inc /= 3; 

for (k=noff+inc+l;k<=noff+jl;k++) { 
iv=ijb [k]; 
v=sb[k]; 
m=k; 

while (ijb[m-inc] > iv) { 
ijb[m] =ijb[m-inc] ; 
sb [m] =sb [m-inc] ; 
m -= inc; 

if (m-noff <= inc) break; 

> 

ijb[m]=iv; 
sb [m] =v; 

> 

> while (inc > 1); 

> 

> 


The above routine embeds internally a sorting algorithm from §8.1, but calls the external 
routine iindexx to construct the initial column index. This routine is identical to indexx, 
as listed in §8.4, except that the latter’s two float declarations should be changed to long. 
(The Numerical Recipes diskettes include both indexx and iindexx.) In fact, you can 
often use indexx without making these changes, since many computers have the property 
that numerical values will sort correctly independently of whether they are interpreted as 
floating or integer values. 


As final examples of the manipulation of sparse matrices, we give two routines for the 
multiplication of two sparse matrices. These are useful for techniques to be described in § 13.10. 

In general, the product of two sparse matrices is not itself sparse. One therefore wants 
to limit the size of the product matrix in one of two ways: either compute only those elements 
of the product that are specified in advance by a known pattern of sparsity, or else compute all 
nonzero elements, but store only those whose magnitude exceeds some threshold value. The 
former technique, when it can be used, is quite efficient. The pattern of sparsity is specified 
by furnishing an index array in row-index sparse storage format (e.g., ija). The program 
then constructs a corresponding value array (e.g., sa). The latter technique runs the danger of 
excessive compute times and unknown output sizes, so it must be used cautiously. 

With row-index storage, it is much more natural to multiply a matrix (on the left) by 
the transpose of a matrix (on the right), so that one is crunching rows on rows, rather than 
rows on columns. Our routines therefore calculate A • B T , rather than A ■ B. This means 
that you have to run your right-hand matrix through the transpose routine sprstp before 
sending it to the matrix multiply routine. 
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The two implementing routines, spr spm for “pattern multiply” and spr stm for “threshold 
multiply” are quite similar in structure. Both are complicated by the logic of the various 
combinations of diagonal or off-diagonal elements for the two input streams and output stream. 


void sprspm(float sa[] , unsigned long ija[] , float sb[] , unsigned long ijb[], 
float sc[], unsigned long ijc[]) 

Matrix multiply A • B T where A and B are two sparse matrices in row-index storage mode, and 
B t is the transpose of B. Here, sa and ija store the matrix A; sb and ijb store the matrix B. 
This routine computes only those components of the matrix product that are pre-specified by the 
input index array i j c, which is not modified. On output, the arrays sc and ij c give the product 
matrix in row-index storage mode. For sparse matrix multiplication, this routine will often be 
preceded by a call to sprstp, so as to construct the transpose of a known matrix into sb, ijb. 
{ 

void nrerror(char error_text []); 
unsigned long i,ijma,ijmb,j,m,ma,mb,mbb,mn; 
float sum; 

if (ija[l] ! = ijb[1] I I ija[l] != ijc[l]) 
nrerror("sprspm: sizes do not match"); 
for (i=l;i<=ijc[l]-2;i++) { Loop over rows. 

j=m=i; Set up so that first pass through loop does the 

mn=ijc[i]; diagonal component. 

sum=sa[i]*sb[i] ; 

for (;;) { Main loop over each component to be output. 

mb=ijb [j] ; 

for (ma=ija[i] ;ma<=ija[i+l]-1 ;ma++) { 

Loop through elements in A's row. Convoluted logic, following, accounts for the 
various combinations of diagonal and off-diagonal elements. 
ijma=ija[ma] ; 

if (ijma == j) sum += sa[ma] *sb[j] ; 
else { 

while (mb < ijb[j+l]) { 
ijmb=ijb[mb] ; 
if (ijmb == i) { 

sum += sa[i] *sb [mb++] ; 
continue; 

} else if (ijmb < ijma) { 
mb++; 
continue; 

} else if (ijmb == ijma) { 
sum += sa[ma] *sb [mb++] ; 
continue; 

> 

break; 

> 

> 

> 

for (mbb=mb;mbb<=ijb[j+l]-l;mbb++) { Exhaust the remainder of B's row. 
if (ijbfmbb] == i) sum += sa[i]*sb[mbb] ; 

> 

sc[m]=sum; 

sum=0.0; Reset indices for next pass through loop, 

if (mn >= ijc[i+l]) break; 
j=ijc [m=mn++] ; 

> 

> 

> 



#include <math.h> 

void sprstm(float sa[] , unsigned long ija[] , float sb[], unsigned long ijb[], 
float thresh, unsigned long nmax, float sc[], unsigned long ijc[]) 
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Matrix multiply A • B T where A and B are two sparse matrices in row-index storage mode, and 
B t is the transpose of B. Here, sa and ija store the matrix A; sb and ijb store the matrix 
B. This routine computes all components of the matrix product (which may be non-sparse!), 
but stores only those whose magnitude exceeds thresh. On output, the arrays sc and ijc 
(whose maximum size is input as nmax) give the product matrix in row-index storage mode. 
For sparse matrix multiplication, this routine will often be preceded by a call to sprstp, so as 
to construct the transpose of a known matrix into sb, ijb. 

{ 

void nrerror(char error_text[]); 
unsigned long i,ijma,ijmb,j,k,ma,mb,mbb; 
float sum; 

if (ija[l] != ijb[l]) nrerror("sprstm: sizes do not match"); 
ijc [1] =k=ija[l] ; 

for (i=l;i<=ija[l]-2;i++) { Loop over rows of A, 

for (j=l; j<=ijb[l]-2; j++) { and rows of B. 

if (i == j) sum=sa[i]*sb[j]; else sum=0.0e0; 
mb=ijb [j] ; 

for (ma=ija[i] ;ma<=ija[i+l]-1 ;ma++) { 

Loop through elements in A's row. Convoluted logic, following, accounts for the 
various combinations of diagonal and off-diagonal elements. 
ijma=ija[ma] ; 

if (ijma == j) sum += sa[ma]*sb[j] ; 
else { 

while (mb < ijb[j+l]) { 


> 

> 


if (ijmb == i) { 

sum += sa[i]*sb[mb++] ; 
continue; 

} else if (ijmb < ijma) { 
mb++; 
continue; 

} else if (ijmb == ijma) { 
sum += sa[ma]*sb[mb++] ; 
continue; 

> 

break; 

> 

> 

> 

for (mbb=mb;mbb<=ijb[j+l]-l;mbb++) { Exhaust the remainder of B's row. 
if (ijb [mbb] == i) sum += sa[i] *sb [mbb] ; 

> 

if (i == j) sc[i]=sum; Where to put the answer... 

else if (fabs(sum) > thresh) { 

if (k > nmax) nrerror("sprstm: nmax too small"); 

sc[k]=sum; 

ijc [k++] = j ; 

> 

> 

ijc[i+l]=k; 


Conjugate Gradient Method for a Sparse System 

So-called conjugate gradient methods provide a quite general means for solving the 
N x N linear system 



A x = b (2.7.29) 

The attractiveness of these methods for large sparse systems is that they reference A only 
through its multiplication of a vector, or the multiplication of its transpose and a vector. As 
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we have seen, these operations can be very efficient for a properly stored sparse matrix. You, 
the “owner” of the matrix A, can be asked to provide functions that perform these sparse 
matrix multiplications as efficiently as possible. We, the “grand strategists” supply the general 
routine, linbcg below, that solves the set of linear equations, (2.7.29), using your functions. 

The simplest, “ordinary” conjugate gradient algorithm [11-13] solves (2.7.29) only in the 
case that A is symmetric and positive definite. It is based on the idea of minimizing the function 

/(x) = g x • A • x — b • x (2.7.30) 

This function is minimized when its gradient 

V/ = A • x b (2.7.31) 

is zero, which is equivalent to (2.7.29). The minimization is carried out by generating a 
succession of search directions p fc and improved minimizers Xk. At each stage a quantity a k 
is found that minimizes f(x k + c:u-p fc ), and X; s+ i is set equal to the new point x*, + «fcp fc . 
The p fc and x*. are built up in such a way that x^+i is also the minimizer of / over the whole 
vector space of directions already taken, (p,, p 2 ,..., p fc }. After N iterations you arrive at 
the minimizer over the entire vector space, i.e., the solution to (2.7.29). 

Later, in §10.6, we will generalize this “ordinary” conjugate gradient algorithm to the 
minimization of arbitrary nonlinear functions. Here, where our interest is in solving linear, 
but not necessarily positive definite or symmetric, equations, a different generalization is 
important, the biconjugate gradient method. This method does not, in general, have a simple 
connection with function minimization. It constructs four sequences of vectors, r* : , iy s , p fc , 

p fc , k = 1,2,_You supply the initial vectors ri and fi, and set Pj = ri. p, !=f §i. Then 

you carry out the following recurrence: 


otk 


ffc • r k 
Pfc A Pfc 


ffc+i = r k — a fc A • p fc 


Tfc+l = Tfc — «fcA T • p fc 

n r fc+l ‘ Tfc+1 

k r fc • r k 

Pfc + i = ffc+i+ p k p k 

pfc+i = f fc+i+Apt 


(2.7.32) 


This sequence of vectors satisfies the biorthogonality condition 


r; • Fj = Fi • Tj =0, j <i 

(2.7.33) 

and the biconjugacy condition 


Pi A Vj = Pi AT Pi ° : 3 < * 

(2.7.34) 

There is also a mutual orthogonality. 


fi • Pj = r, ■ pj =0, j < i 

(2.7.35) 


The proof of these properties proceeds by straightforward induction [14], As long as the 
recurrence does not break down earlier because one of the denominators is zero, it must 
terminate after m < N steps with r m +i = r m+ i = 0. This is basically because after at most 
N steps you run out of new orthogonal directions to the vectors you’ve already constructed. 

To use the algorithm to solve the system (2.7.29), make an initial guess xi for the 
solution. Choose ri to be the residual 

ri = b - A xi (2.7.36) 

and choose ri = ri. Then form the sequence of improved estimates 



Xfc+i = Xfc + a k Pk 


(2.7.37) 
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while carrying out the recurrence (2.7.32). Equation (2.7.37) guarantees that r^+i from the 
recurrence is in fact the residual b — A • xj. +] corresponding to Xfc+i. Since r m+ i = 0, 
x m+ i is the solution to equation (2.7.29). 

While there is no guarantee that this whole procedure will not break down or become 
unstable for general A, in practice this is rare. More importantly, the exact termination in at 
most N iterations occurs only with exact arithmetic. Roundoff error means that you should 
regard the process as a genuinely iterative procedure, to be halted when some appropriate 
error criterion is met. 

The ordinary conjugate gradient algorithm is the special case of the biconjugate gradient 
algorithm when A is symmetric, and we choose ri-iss jq. Then iq, = iq and p fc = p A for all 
k; you can omit computing them and halve the work of the algorithm. This conjugate gradient 
version has the interpretation of minimizing equation (2.7.30). If A is positive definite as 
well as symmetric, the algorithm cannot break down (in theory!). The routine linbcg below 
indeed reduces to the ordinary conjugate gradient method if you input a symmetric A, but 
it does all the redundant computations. 

Another variant of the general algorithm corresponds to a symmetric but non-positive 
definite A, with the choice IT = A • ri instead of IT = ri. In this case fk = A • r*, and 
p fc = A • p fc for all k. This algorithm is thus equivalent to the ordinary conjugate gradient 
algorithm, but with all dot products a • b replaced by a • A • b. It is called the minimum residual 
algorithm, because it corresponds to successive minimizations of the function 

$(x) = ir-r = | |A-x-b| 2 (2.7.38) 

where the successive iterates Xk minimize $ over the same set of search directions p A generated 
in the conjugate gradient method. This algorithm has been generalized in various ways for 
unsymmetric matrices. The generalized minimum residual method (GMRES; see [9,15]) is 
probably the most robust of these methods. 

Note that equation (2.7.38) gives 

V$(x) = A t • (A • x - b) (2.7.39) 

For any nonsingular matrix A, A T • A is symmetric and positive definite. You might therefore 
be tempted to solve equation (2.7.29) by applying the ordinary conjugate gradient algorithm 
to the problem 


(A t • A) • x = A t • b (2.7.40) 

Don’t! The condition number of the matrix A T • A is the square of the condition number of 
A (see §2.6 for definition of condition number). A large condition number both increases the 
number of iterations required, and limits the accuracy to which a solution can be obtained. It 
is almost always better to apply the biconjugate gradient method to the original matrix A. 

So far we have said nothing about the rate of convergence of these methods. The 
ordinary conjugate gradient method works well for matrices that are well-conditioned, i.e., 
“close” to the identity matrix. This suggests applying these methods to the preconditioned 
form of equation (2.7.29), 


(A 1 • A) • x = A 1 b (2.7.41) 

The idea is that you might already be able to solve your linear system easily for some A close 
to A, in which case A -1 • A « 1, allowing the algorithm to converge in fewer steps. The 
matrix A is called a preconditioner [11 ], and the overall scheme given here is known as the 
preconditioned biconjugate gradient method or PBCG. 

For efficient implementation, the PBCG algorithm introduces an additional set of vectors 
z k and Zfc defined by 



and 


• z fc = r fc 


A • z fc = r k 


(2.7.42) 
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and modifies the definitions of ak, 0k, p+ and p fc in equation (2.7.32): 


otk 


ffc ■ Zfc 
Pfc A Pfc 


0k = 


Ffc+1'Zfc+l 
Ffc Z k 


Pfc+1 — Zfc+l + 0kP k 


P fc+ 1 = Zfc+1 + 0kPk 


(2.7.43) 


For linbcg, below, we will ask you to supply routines that solve the auxiliary linear systems 
(2.7.42). If you have no idea what to use for the preconditioner A, then use the diagonal part 
of A, or even the identity matrix, in which case the burden of convergence will be entirely 
on the biconjugate gradient method itself. 

The routine linbcg, below, is based on a program originally written by Anne Greenbaum. 
(See [13] for a different, less sophisticated, implementation.) There are a few wrinkles you 
should know about. 

What constitutes “good” convergence is rather application dependent. The routine 
linbcg therefore provides for four possibilities, selected by setting the flag itol on input. 
If itol=l, iteration stops when the quantity |A • x — b|/|b| is less than the input quantity 
tol. If itol=2, the required criterion is 

|A 1 • (A • x — b) |/|A 1 • b| < tol (2.7.44) 

If itol=3, the routine uses its own estimate of the error in x, and requires its magnitude, 
divided by the magnitude of x, to be less than tol. The setting itol=4 is the same as itol=3, 
except that the largest (in absolute value) component of the error and largest component of x 
are used instead of the vector magnitude (that is, the Loo norm instead of the L 2 norm). You 
may need to experiment to find which of these convergence criteria is best for your problem. 

On output, err is the tolerance actually achieved. If the returned count iter does 
not indicate that the maximum number of allowed iterations itmax was exceeded, then err 
should be less than tol. If you want to do further iterations, leave all returned quantities as 
they are and call the routine again. The routine loses its memory of the spanned conjugate 
gradient subspace between calls, however, so you should not force it to return more often 
than about every N iterations. 

Finally, note that linbcg is furnished in double precision, since it will be usually be 
used when N is quite large. 


#include <stdio.h> 

#include <math.h> 

#include "nrutil.h" 

#define EPS 1.0e-14 

void linbcg(unsigned long n, double b[], double x[], int itol, double tol, 
int itmax, int *iter, double *err) 

Solves A ■ x = b for x [1. .n] , given b [1. .n] , by the iterative biconjugate gradient method. 
On input x[l. .n] should be set to an initial guess of the solution (or all zeros); itol is 1,2,3, 
or 4, specifying which convergence test is applied (see text); itmax is the maximum number 
of allowed iterations; and tol is the desired convergence tolerance. On output, x[l. .n] is 
reset to the improved solution, iter is the number of iterations actually taken, and err is the 
estimated error. The matrix A is referenced only through the user-supplied routines atimes, 
which computes the product of either A or its transpose on a vector; and asolve, which solves 
A x = b or A x = b for some preconditioner matrix A (possibly the trivial diagonal part of A). 

void asolve(unsigned long n, double b[], double x[], int itrnsp); 

void atimes(unsigned long n, double x[], double r[], int itrnsp); 

double snrm(unsigned long n, double sx[], int itol); 

unsigned long j; 

double ak,akden,bk,bkden,bknum,bnrm,dxnrm,xnrm,zmlnrm,znrm; 

double *p,*pp,*r,*rr,*z,*zz; Double precision is a good idea in this routine. 



s o- i 
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p=dvector(1,n); 
pp=dvector(l,n) 
r=dvector(l,n); 
rr=dvector(1,n) 
z=dvector(l,n); 
zz=dvector(1,n) 


Calculate initial residual. 

*iter=0; 

atimes(n,x,r,0); 
for (j=l;j<=n;j++) { 
r [j] =b[j] -r [j] ; 
rr [j]=r[j] ; 

> 

/* atimes(n,r,rr,0); */ 
if (itol == 1) { 

bnrm=snrm(n,b > itol); 
asolve(n,r,z,0); 

> 

else if (itol == 2) { 
asolve(n,b,z,0); 
bnrm=snrm(n,z,itol); 
asolve(n,r,z,0); 

> 

else if (itol == 3 I I itol == 4) { 
asolve(n,b,z,0); 
bnrm=snrm(n,z > itol); 
asolve (n, r, z, 0); 
znrm=snrm(n,z,itol); 

> else nrerror("illegal itol in linbcg"); 
while (*iter <= itmax) { Main loop. 

++(*iter); 

~T 

asolve(n,rr,zz,l); Final 1 indicates use of transpose matrix A . 

for (bknum=0.0,j=l;j<=n;j++) bknum += z[j]*rr[j]; 

Calculate coefficient bk and direction vectors p and pp. 
if (*iter == 1) { 

for (j=l;j<=n;j++) { 
p[j]=z[j] ; 
pp[j]=zz[j] ; 

> 

> 

else { 

bk=bknum/bkden; 
for (j=l;j<=n;j++) { 
p[j]=bk*p[j]+z[j] ; 
pp[j]=bk*pp[j]+zz[j]; 

> 

> 

bkden=bknum; Calculate coefficient ak, new iterate x, and new 

atimes(n,p,z,0) ; residuals r and rr. 

for (akden=0.0,j=l;j<=n;j++) akden += z[j]*pp[j]; 
ak=bknum/akden; 
atimes(n,pp,zz,l); 
for (j=l;j<=n;j++) { 
x[j] += ak*p[j] ; 
r [j] -= ak*z[j] ; 
rr [j] -= ak*zz [j] ; 

> 

asolve(n,r,z,0) ; Solve A • z = r and check stopping criterion, 

if (itol == 1) 

*err=snrm(n,r,itol)/bnrm; 
else if (itol == 2) 

*err=snrm(n,z,itol)/bnrm; 


Input to atimes is x[l. .n] , output is r[l. .n]; 
the final 0 indicates that the matrix (not its 
transpose) is to be used. 


Uncomment this line to get the "minimum resid¬ 
ual" variant of the algorithm. 

Input to asolve is r[l. .n] , output is z[l. .n]; 
the final 0 indicates that the matrix A (not 
its transpose) is to be used. 
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else if (itol ==311 itol == 4) { 
zmlnrm=znrm; 
znrni=snrm(n,z,itol); 
if (fabs(zmlnrm-znrm) > EPS*znrm) { 
dxnrm=fabs(ak)*snrm(n,p,itol); 

*err=znrm/f abs (zmlnrm-znrm) *dxnrm; 

> else { 

*err=znrm/bnrm; Error may not be accurate, so loop again, 

continue; 

> 

xnrm=snrm(n,x,itol); 

if (*err <= 0.5*xnrm) *err /= xnrm; 

else { 

*err=znrm/bnrm; Error may not be accurate, so loop again, 

continue; 

> 

> 

printf ("iter=’/,4d err=’/,12.6f\n" ,*iter ,*err) ; 
if (*err <= tol) break; 

> 

free_dvector(p,l,n); 
free_dvector(pp,1,n); 
free_dvector(r,l,n); 
free.dvector(rr,1,n); 
free_dvector(z, 1 , n); 
free_dvector(zz,l,n); 


The routine linbcg uses this short utility for computing vector norms: 

#include <math.h> 

double snrm(unsigned long n, double sx[], int itol) 

Compute one of two norms for a vector sx[l. .n] , as signaled by itol. Used by linbcg. 

{ 

unsigned long i,isamax; 
double ans; 

if (itol <= 3) { 
ans = 0.0; 

for (i=l;i<=n; i++) ans += sx[i]*sx[i]; 
return sqrt(ans); 

> else { 

isamax=l; 

for (i=l;i<=n;i++) { 

if (fabs(sx[i]) > fabs(sx[isamax])) 

> 

return fabs(sx[isamax]); 

> 

> 


Vector magnitude norm. 


Largest component norm. 
isamax=i; 


So that the specifications for the routines atimes and asolve are clear, we list here 
simple versions that assume a matrix A stored somewhere in row-index sparse format. 

extern unsigned long ija[] ; 

extern double sa[] ; The matrix is stored somewhere. 

void atimes (unsigned long n, double x[], double r[], int itrnsp) 

{ 

void dspr sax (double sa[] , unsigned long ija[] , double x[], double b[], 
unsigned long n); 
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void dsprstx(double sa[] , unsigned long ija[] , double x[] , double b[], 
unsigned long n); 

These are double versions of sprsax and sprstx. 

if (itrnsp) dsprstx(sa,ija,x,r,n); 
else dsprsaxfsajija.x.r.n); 


extern unsigned long ija[] ; 

extern double sa[] ; The matrix is stored somewhere. 

void asolve(unsigned long n, double b[], double x[], int itrnsp) 
{ 

unsigned long i; 


for (i=l; i<^n; i++) x[i] = (sa[i] != 0.0 ? b[i]/sa[i] : b[i]); 

The matrix A is the diagonal part of A, stored in the first n elements of sa. Since the 
transpose matrix has the same diagonal, the flag itrnsp is not used. 


CITED REFERENCES AND FURTHER READING: 

Tewarson, R.P. 1973, Sparse Matrices (New York: Academic Press). [1] 

Jacobs, D.A.H. (ed.) 1977, The State of the Art in Numerical Analysis (London: Academic Press), 
Chapter 1.3 (by J.K. Reid). [2] 

George, A., and Liu, J.W.H. 1981, Computer Solution of Large Sparse Positive Definite Systems 
(Englewood Cliffs, NJ: Prentice-Hall). [3] 

NAG Fortran Library (Numerical Algorithms Group, 256 Banbury Road, Oxford OX27DE, U.K.). 
[4] 

IMSL Math/Library Users Manual (\MSL Inc., 2500 CityWest Boulevard, Houston TX 77042). [5] 
Eisenstat, S.C., Gursky, M.C., Schultz, M.H., and Sherman, A.H. 1977, Yale Sparse Matrix Pack¬ 
age, Technical Reports 112 and 114 (Yale University Department of Computer Science). [6] 
Knuth, D.E. 1968, Fundamental Algorithms, vol. 1 of The Art of Computer Programming (Reading, 
MA: Addison-Wesley), §2.2.6. [7] 

Kincaid, D.R., Respess, J.R., Young, D.M., and Grimes, R.G. 1982, ACM Transactions on Math¬ 
ematical Software, vol. 8, pp. 302-322. [8] 

PCGPAK User’s Guide (New Haven: Scientific Computing Associates, Inc.). [9] 

Bentley, J. 1986, Programming Pearls (Reading, MA: Addison-Wesley), §9. [10] 

Golub, G.H., and Van Loan, C.F. 1989, Matrix Computations, 2nd ed. (Baltimore: Johns Hopkins 
University Press), Chapters 4 and 10, particularly §§10.2-10.3. [11] 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
Chapter 8. [12] 

Baker, L. 1991, More C Tools for Scientists and Engineers (New York: McGraw-Hill). [13] 
Fletcher, R. 1976, in Numerical Analysis Dundee 1975, Lecture Notes in Mathematics, vol. 506, 
A. Dold and B Eckmann, eds. (Berlin: Springer-Verlag), pp. 73-89. [14] 

Saad, Y., and Schulz, M. 1986, SIAM Journal on Scientific and Statistical Computing, vol. 7, 
pp. 856-869. [15] 

Bunch, J.R., and Rose, D.J. (eds.) 1976, Sparse Matrix Computations (New York: Academic 
Press). 

Duff, I.S., and Stewart, G.W. (eds.) 1979, Sparse Matrix Proceedings 1978 (Philadelphia: 
S.I.A.M.). 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 




90 


Chapter 2. Solution of Linear Algebraic Equations 


2.8 Vandermonde Matrices and Toeplitz 
Matrices 


In §2.4 the case of a tridiagonal matrix was treated specially, because that 
particular type of linear system admits a solution in only of order N operations, 
rather than of order N 3 for the general linear problem. When such particular types 
exist, it is important to know about them. Your computational savings, should you 
ever happen to be working on a problem that involves the right kind of particular 
type, can be enormous. 

This section treats two special types of matrices that can be solved in of order 
N 2 operations, not as good as tridiagonal, but a lot better than the general case. 
(Other than the operations count, these two types having nothing in common.) 
Matrices of the first type, termed Vandermonde matrices, occur in some problems 
having to do with the fitting of polynomials, the reconstruction of distributions from 
their moments, and also other contexts. In this book, for example, a Vandermonde 
problem crops up in §3.5. Matrices of the second type, termed Toeplitz matrices, 
tend to occur in problems involving deconvolution and signal processing. In this 
book, a Toeplitz problem is encountered in §13.7. 

These are not the only special types of matrices worth knowing about. The 
Hilbert matrices, whose components are of the form a = l/(i + j — 1), i,j = 
1 ,,N can be inverted by an exact integer algorithm, and are very difficult to 
invert in any other way, since they are notoriously ill-conditioned (see [1 ] for details). 
The Sherman-Morrison and Woodbury formulas, discussed in §2.7, can sometimes 
be used to convert new special forms into old ones. Reference [2] gives some other 
special forms. We have not found these additional forms to arise as frequently as 
the two that we now discuss. 


Vandermonde Matrices 


A Vandermonde matrix of size N x TV is completely determined by N arbitrary 
numbers xi, X 2 , ■ ■ ■, xn, in terms of which its N 2 components are the integer powers 
xt 1 , i,j =»!>.. , N. Evidently there are two possible such forms, depending on whether 
we view the i’s as rows, j ’s as columns, or vice versa. In the former case, we get a linear 
system of equations that looks like this. 


"1 

XI 

x\ • 

. 'Y'N 1 
X 1 


’ Cl ' 


2/i 

1 

X2 

x\ ■ 

• r N ~ 1 

x 2 


C2 

= 

2/2 

.1 

XN 

X N ' 

X N 


.Cl V. 


.2 IN. 


( 2 . 8 . 1 ) 


Performing the matrix multiplication, you will see that this equation solves for the unknown 
coefficients d which fit a polynomial to the N pairs of abscissas and ordinates (xj,yj). 
Precisely this problem will arise in §3.5, and the routine given there will solve (2.8.1) by the 
method that we are about to describe. 
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The alternative identification of rows and columns leads to the set of equations 

1 

Xl 
2 


Write this out and you will see that it relates to the problem of moments: Given the values 
of N points Xi, find the unknown weights uj, , assigned so as to match the given values 
q : j of the first N moments. (For more on this problem, consult [3].) The routine given in 
this section solves (2.8.2). 




The method of solution of both (2.8.1) and (2.8.2) is closely related to Lagrange’s 
polynomial interpolation formula, which we will not formally meet until §3.1 below. Notwith¬ 
standing, the following derivation should be comprehensible: 

Let Pj ( x ) be the polynomial of degree N I defined by 

n ^r=x>^ fc_i ( 2 - 8 - 3 ) 

n=l 3 n k= 1 

(n 7S3) 

Here the meaning of the last equality is to define the components of the matrix Aij as the 
coefficients that arise when the product is multiplied out and like terms collected. 

The polynomial Pj(x) is a function of x generally. But you will notice that it is 
specifically designed so that it takes on a value of zero at all Xi with i ^ j, and has a value 
of unity at x = x r In other words. 


Pj{Xi) — 5ij — AjkX t 


But (2.8.4) says that Ajk is exactly the inverse of the matrix of components a^ -1 , which 
appears in (2.8.2), with the subscript as the column index. Therefore the solution of (2.8.2) 
is just that matrix inverse times the right-hand side, 


*3 = E " 


As for the transpose problem (2.8.1), we can use the fact that the inverse of the transpose 
is the transpose of the inverse, so 


cj = E Ak ry 


The routine in §3.5 implements this. 

It remains to find a good way of multiplying out the monomial terms in (2.8.3), in order 
to get the components of A jk . This is essentially a bookkeeping problem, and we will let you 
read the routine itself to see how it can be solved. One trick is to define a master P(x) by 


p (x) = II “ Xn ) 


work out its coefficients, and then obtain the numerators and denominators of the specific Pj ’s 
via synthetic division by the one supernumerary term. (See §5.3 for more on synthetic division.) 
Since each such division is only a process of order N, the total procedure is of order N 2 . 

You should be warned that Vandermonde systems are notoriously ill-conditioned, by 
their very nature. (As an aside anticipating §5.8, the reason is the same as that which makes 
Chebyshev fitting so impressively accurate: there exist high-order polynomials that are very 
good uniform fits to zero. Hence roundoff error can introduce rather substantial coefficients 
of the leading terms of these polynomials.) It is a good idea always to compute Vandermonde 
problems in double precision. 
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The routine for (2.8.2) which follows is due to G.B. Rybicki. 


void vander(double x[], double w[] , double q[] , int n) 

Solves the Vandermonde linear system YliLi %i~ 1 Wi = qk (k = 1,..., N). Input consists of 
the vectors x[l. .n] and q[l. .n]; the vector w[l. .n] is output, 
f 

int i,j,k; 
double b,s,t,xx; 
double *c; 


c=dvector(l,n); 
if (n == 1) w[l]=q[l] ; 
else { 

for (i=l;i<=n;i++) c[i]=0.0; Initialize array. 

c[n] = -x[l] ; Coefficients of the n 

for (i=2;i<=n;i++) { by recursion, 

xx = -x[i] ; 

for (j = (n+l-i); j<=(n-l) ; j++) c[j] +=xx*c[j+l]; 
c [n] += xx; 

> 

for (i=l;i<=n;i++) { Each subfactor in turn 

xx=x [i]; 
t=b=l.0: 


Initialize array. 

Coefficients of the master polynomial are found 
by recursion. 


for (k=n;k>=2;k—) { 
b=c[k] +xx*b; 
s += q[k-l]*b; 
t=xx*t+b; 

> 

w[i]=s/t; 


5 synthetically divided, 


matrix-multiplied by the right-hand side, 


and supplied with a denominator. 


i.dvector (c, 1 ,n); 


Toeplitz Matrices 

An N x N Toeplitz matrix is specified by giving 2N — 1 numbers Rk ■ k = —N + 
1,..., —1,0,1,..., N — 1. Those numbers are then emplaced as matrix elements constant 
along the (upper-left to lower-right) diagonals of the matrix: 


r Ro 

R-1 

R-2 

' ' ' R-(N-2 

R-(n-i) " 

Ri 

Ro 

R-1 

' ' ' R-(N-3 

R-(N- 2) 

r 2 

Ri 

Ro 

■ ■' R-(N- 

R—(N—3) 

Rn-2 

Rn-3 

Rn-a 

Ro 

R-1 

- Rn—1 

Rn-2 

Rn-3 

JJi 

Ro J 


The linear Toeplitz problem can thus be written as 



Y^Ri-jX^yi (i = l,...,N) 


(2.8.9) 


where the Xj’s, j = 1,..., N, are the unknowns to be solved for. 

The Toeplitz matrix is symmetric if Rk = R-k for all k. Levinson [4] developed an 
algorithm for fast solution of the symmetric Toeplitz problem, by a bordering method, that is, 


Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 











2.8 Vandermonde Matrices and Toeplitz Matrices 


93 


a recursive procedure that solves the M-dimensional Toeplitz problem 

M 

YsRi-jX^ =Vi (* = 1,..., M) (2.8.10) 

i= I 

in turn for M = 1, 2,... until M = N, the desired result, is finally reached. The vector 
is the result at the Mth stage, and becomes the desired answer only when N is reached. 

Levinson’s method is well documented in standard texts (e.g., [5]). The useful fact that 
the method generalizes to the nonsymmetric case seems to be less well known. At some risk 
of excessive detail, we therefore give a derivation here, due to G.B. Rybicki. 

In following a recursion from step M to step M + 1 we find that our developing solution 
changes in this way: 

M 

Y^Ri-ixf ) = Vi i = 1,..., M (2.8.11) 

i=i 

becomes 

M 

X Ri-M M+1) + fli—(M+1)*£2 l X) = Vi i = 1,..., M + 1 (2.8.12) 

i=i 


By eliminating j/j we find 


M 





_ „( M +1) 

(M+l) 

M+l 


— Ri-(M+ 1) 


i = 1,..., M 


or by letting * —> M + 1 — i and j —> M + 1 — j, 

M 

Y l Ri-iGp = R- i 

3 = 1 


where 


To put this another way, 


x 


(M+l) 
M+l-j 


X 


(M) 

M+l—j 


— X 


(M+l) 
M+l—j 


X 


(M+l) 

M+l 


X 


(M) 

M+l—jf 


(M+l) 
X M+1 


2 = 1 , 


M 


(2.8.13) 


(2.8.14) 


(2.8.15) 


(2.8.16) 


Thus, if we can use recursion to find the order M quantities x^ M ' 1 and G ( ' M ' 1 and the single 
order M+l quantity x^ 1 *, then all of the other x ( j M+1) will follow. Fortunately, the 
quantity x^’^ 1 ' 1 follows from equation (2.8.12) with i = M + 1, 


M 

RM+i-jX^ M+1) + Rox^+V = y M +1 (2.8.17) 

3 = 1 


For the unknown order M + l quantities a^ M+1) we can substitute the previous order 
quantities in G since 


G 


(M) 

M+l-j 


_ r (M+1) 
X j 

(M+l) 

M+l 


The result of this operation is 


x 


(M+l) 

M+l 


ET=i R m+i-M M) -VM +1 

^iRM+i-jG^l^-Rv 


(2.8.18) 



(2.8.19) 
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The only remaining problem is to develop a recursion relation for G. Before we do 
that, however, we should point out that there are actually two distinct sets of solutions to the 
original linear problem for a nonsymmetric matrix, namely right-hand solutions (which we 
have been discussing) and left-hand solutions Zi. The formalism for the left-hand solutions 
differs only in that we deal with the equations 

M 

Rj-iZj M) =Vi i = 1,..., M (2.8.20) 

j=i 

Then, the same sequence of operations on this set leads to 

M 

= Ri (2.8.21) 



_ „(M+1) 
-3 Z M+l-j 
(M+l) 


(compare with 2.8.14 - 2.8.15). The reason for mentioning the left-hand solutions now is 
that, by equation (2.8.21), the Hj satisfy exactly the same equation as the x, except for 
the substitution y t — ^ R % on the right-hand side. Therefore we can quickly deduce from 
equation (2.8.19) that 

h (m+ l) _ Z)j=i ft.w-H :iHj ) ~ Rm+i q 

Rm+1 - M (M) R (2.8.2J) 

2^=1 — H o 

By the same token, G satisfies the same equation as z, except for the substitution y t —> R-i. 
This gives 

C (M+1) _ T,jLl Rj-M-lG^ - R-m-1 

G M+1 - R - rrW) - R ( 2 - 8 - 24 ) 

L,j=1 R3-M-i** M+1 -j - Ro 

The same “morphism” also turns equation (2.8.16), and its partner for z, into the final equations 

MM+l) _ MM) MM+ 1) rr(M) 

- Gj g m+1 n xl . , j 

rr(M+1) _ U (M) U (M+1) n (M) (2.8.25) 

Rj Rm+i g m+ 


Now, starting with the initial values 


4 1} = yi/Ro Gi x) = R-i/Ro = R 1 /R 0 (2.8.26) 

we can recurse away. At each stage M we use equations (2.8.23) and (2.8.24) to find 
’ and then equation (2.8.25) to find the other components of JT (M+1) , G (M+1 ' 1 
From there the vectors x^ M+r> and/or z (M+r> are easily calculated. 

The program below does this. It incorporates the second equation in (2.8.25) in the form 

rr(M+1) _ „(M) u (M+1)MM) 8 

H M+l-j ~~ U M+l~j ~ H M+1 Gj (Z.S.Z/) 

so that the computation can be done “in place.” 

Notice that the above algorithm fails if Ro = 0. In fact, because the bordering method 
does not allow pivoting, the algorithm will fail if any of the diagonal principal minors of the 
original Toeplitz matrix vanish. (Compare with discussion of the tridiagonal algorithm in 
§2.4.) If the algorithm fails, your matrix is not necessarily singular — you might just have 
to solve your problem by a slower and more general algorithm such as LU decomposition 
with pivoting. 

The routine that implements equations (2.8.23)-(2.8.27) is also due to Rybicki. Note 
that the routine’s r [n+j ] is equal to Rj above, so that subscripts on the r array vary from 
1 to 2N - 1. 
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#include "nrutil.h" 

#define FREERETURN Ifree_vector(h, 1 ,n);free_vector(g,1,n);return;} 
void toeplz(float r[], float x[], float y[], int n) 

Solves the Toeplitz system R(N+i-j) x j = Vi (* = 1 ,- • •, N). The Toeplitz matrix need 

not be symmetric. y[l..n] and r[1. .2*n-l] are input arrays; x[1. .n] is the output array. 
{ 

int j,k,m,ml,m2; 

float pp,pt1,pt2,qq,qt1,qt2,sd,sgd,sgn,shn,sxn; 
float *g,*h; 


> 


if (r[n] == 0.0) nrerror("toeplz-1 singular principal minor"); 
g=vector(l,n); 
h=vector(l,n); 

x[l]=y[l]/r [n] ; Initialize for the recursion, 

if (n == 1) FREERETURN 
g[l]=r[n-l]/r [n] ; 
h[l]=r [n+l]/r [n] ; 

for (m=l ;m<=n;m++) { Main loop over the recursion. 

ml=m+l; 

sxn = -y[ml] ; Compute numerator and denominator for x, 

sd = -r [n] ; 

for (j=l;j<=m;j++) { 

sxn += r[n+ml-j]*x[j] ; 
sd += r[n+ml-j]*g[m-j+l] ; 

> 

if (sd == 0.0) nrerror("toeplz-2 singular principal minor"); 
x[ml]=sxn/sd; whence x. 

for (j=l; j<=m; j++) x[j] -= x [ml] *g[m-j+l] ; 
if (ml == n) FREERETURN 

sgn = -r[n-ml] ; Compute numerator and denominator for G and H, 

shn = -r[n+ml]; 

sgd = -r [n] ; 

for (j=l;j<=m;j++) { 

sgn += r [n+j-ml] *g[j] ; 
shn += r [n+ml-j] *h[j] ; 
sgd += r[n+j-ml]*h[m-j+l] ; 

} 

if (sgd == 0.0) nrerror("toeplz-3 singular principal minor"); 

g[ml]=sgn/sgd; whence G and H. 

h[ml]=shn/sd; 

k=m; 

m2=(m+l) » 1; 
pp=g[ml] ; 
qq=h [ml] ; 

for (j=l;j<=m2;j++) { 
ptl=g[j] ; 
pt2=g [k] ; 
qtl=h[j] ; 
qt2=h[k] ; 
g[j]=ptl-pp*qt2; 
g[k]=pt2-pp*qtl; 
h[j]=qtl-qq*pt2; 
h [k—] =qt2-qq*ptl; 

> 

> Back for another recurrence, 

nrerror("toeplz - should not arrive here!"); 



If you are in the business of solving very large Toeplitz systems, you should find out about 
so-called “new, fast” algorithms, which require only on the order of A r (log N) 2 operations, 
compared to N 2 for Levinson’s method. These methods are too complicated to include here. 
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Papers by Bunch [6] and de Hoog [7] will give entry to the literature. 


CITED REFERENCES AND FURTHER READING: 

Golub, G.H., and Van Loan, C.F. 1989, Matrix Computations, 2nd ed. (Baltimore: Johns Hopkins 
University Press), Chapter 5 [also treats some other special forms]. 

Forsythe, G.E., and Moler, C.B. 1967, Computer Solution of Linear Algebraic Systems (Engle¬ 
wood Cliffs, NJ: Prentice-Hall), §19. [1] 

Westlake, J.R. 1968, A Handbook of Numerical Matrix Inversion and Solution of Linear Equations 
(New York: Wiley). [2] 

von Mises, R. 1964, Mathematical Theory of Probability and Statistics (New York: Academic 
Press), pp. 394ff. [3] 

Levinson, N., Appendix B of N. Wiener, 1949, Extrapolation, Interpolation and Smoothing of 
Stationary Time Series (New York: Wiley). [4] 

Robinson, E.A., andTreitel, S. 1980, Geophysical Signal Analysis (Englewood Cliffs, NJ: Prentice- 
Hall), pp. 163ff. [5] 

Bunch, J.R. 1985, SIAM Journal on Scientific and Statistical Computing, vol. 6, pp. 349-364. [6] 

de Hoog, F. 1987, Linear Algebra and Its Applications, vol. 88/89, pp. 123-138. [7] 


2.9 Cholesky Decomposition 


If a square matrix A happens to be symmetric and positive definite, then it has a 
special, more efficient, triangular decomposition. Symmetric means that = a Jt for 
i,j = 1,... ,N, while positive definite means that 

v-Av>0 for all vectors v (2.9.1) 


(In Chapter 11 we will see that positive definite has the equivalent interpretation that A has 
all positive eigenvalues.) While symmetric, positive definite matrices are rather special, they 
occur quite frequently in some applications, so their special factorization, called Cholesky 
decomposition, is good to know about. When you can use it, Cholesky decomposition is about 
a factor of two faster than alternative methods for solving linear equations. 

Instead of seeking arbitrary lower and upper triangular factors L and U, Cholesky 
decomposition constructs a lower triangular matrix L whose transpose L T can itself serve as 
the upper triangular part. In other words we replace equation (2.3.1) by 

L • L t = A (2.9.2) 

This factorization is sometimes referred to as “taking the square root” of the matrix A. The 
components of L T are of course related to those of L by 

4 = L * ( 2 - 9 - 3 ) 

Writing out equation (2.9.2) in components, one readily obtains the analogs of equations 
(2.3.12)-(2.3.13), 


Lu= o«-£4 

\ k=1 


(2.9.4) 





j = i + 1, i + 2,..., N 


(2.9.5) 
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If you apply equations (2.9.4) and (2.9.5) in the order i = 1,2,... ,N, you will see 
that the L ’s that occur on the right-hand side are already determined by the time they are 
needed. Also, only components ay with j > i are referenced. (Since A is symmetric, 
these have complete information.) It is convenient, then, to have the factor L overwrite the 
subdiagonal (lower triangular but not including the diagonal) part of A, preserving the input 
upper triangular values of A. Only one extra vector of length N is needed to store the diagonal 
part of L. The operations count is N 3 /Q executions of the inner loop (consisting of one 
multiply and one subtract), with also N square roots. As already mentioned, this is about a 
factor 2 better than LU decomposition of A (where its symmetry would be ignored). 

A straightforward implementation is 

#include <math.h> 

void choldc(float **a, int n, float p[]) 

Given a positive-definite symmetric matrix a[l. .n] [1. .n] , this routine constructs its Cholesky 
decomposition, A = L • L T . On input, only the upper triangle of a need be given; it is not 
modified. The Cholesky factor L is returned in the lower triangle of a, except for its diagonal 
elements which are returned in p[l. .n], 

{ 

void nrerror(char error_text[]); 
int i,j,k; 
float sum; 

for (i=l;i<=n;i++) { 

for (j=i;j<=n;j++) { 

for (sum=a[i] [j] ,k=i-l;k>=l;k—) sum -= a[i] [k]*a[j] [k] ; 
if (i == j) { 

if (sum <= 0.0) a, with rounding errors, is not positive definite. 

nrerror("choldc failed"); 
p[i]=sqrt(sum); 

> else a[j] [i]=sum/p[i] ; 

> 

> 

> 


You might at this point wonder about pivoting. The pleasant answer is that Cholesky 
decomposition is extremely stable numerically, without any pivoting at all. Failure of choldc 
simply indicates that the matrix A (or, with roundoff error, another very nearby matrix) is 
not positive definite. In fact, choldc is an efficient way to test whether a symmetric matrix 
is positive definite. (In this application, you will want to replace the call to nrerror with 
some less drastic signaling method.) 

Once your matrix is decomposed, the triangular factor can be used to solve a linear 
equation by backsubstitution. The straightforward implementation of this is 

void cholsl(float **a, int n, float p[], float b[], float x[]) 

Solves the set of n linear equations A • x = b, where a is a positive-definite symmetric matrix. 
a[l. .n] [1. .n] andp[l..n] are input as the output of the routine choldc. Only the lower 
subdiagonal portion of a is accessed. b[l. .n] is input as the right-hand side vector. The 
solution vector is returned in x[l. .n] . a, n, and p are not modified and can be left in place 
for successive calls with different right-hand sides b. b is not modified unless you identify b and 
x in the calling sequence, which is allowed. 

{ 

int i,k; 

float sum; 

for (i=l;i<=n;i++) { Solve L ■ y = b, storing y in x. 

for (sum=b [i] ,k=i-l ;k>=l ;k—) sum -= a[i] [k] *x[k] ; 
x[i]=sum/p[i] ; 

> 

for (i=n;i>=l;i—) { Solve L T • x = y. 

for (sum=x[i],k=i+l;k<=n;k++) sum -= a[k][i]*x[k]; 
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y 

y 


x[i]=sum/p[i] ; 


A typical use of choldc and cholsl is in the inversion of covariance matrices describing 
the fit of data to a model; see, e.g., § 15.6. In this, and many other applications, one often needs 
L -1 . The lower triangle of this matrix can be efficiently found from the output of choldc: 

for (i=l;i<=n;i++) { 
a[i] [i]=1.0/p[i] ; 
for (j=i+l;j<=n;j++) { 
sum=0.0; 

for (k=i;k<j;k++) sum -= a[j] [k]*a[k][i]; 
a[j] [i]=sum/p[j] ; 

> 

> 


CITED REFERENCES AND FURTHER READING: 

Wilkinson, J.H., and Reinsch, C. 1971, Linear Algebra, vol. II of Handbook for Automatic Com¬ 
putation (New York: Springer-Verlag), Chapter 1/1. 

Gill, RE., Murray, W., and Wright, M.H. 1991, Numerical Linear Algebra and Optimization, vol. 1 
(Redwood City, CA: Addison-Wesley), §4.9.2. 

Dahlquist, G., and Bjorck, A. 1974, Numerical Methods (Englewood Cliffs, NJ: Prentice-Hall), 
§5.3.5. 

Golub, G.H., and Van Loan, C.F. 1989, Matrix Computations, 2nd ed. (Baltimore: Johns Hopkins 
University Press), §4.2. 


2.10 QR Decomposition 

There is another matrix factorization that is sometimes very useful, the so-called QR 
decomposition, 

A = Q R (2.10.1) 

Here R is upper triangular, while Q is orthogonal, that is, 

Q T Q = 1 (2.10.2) 

where Q T is the transpose matrix of Q. Although the decomposition exists for a general 
rectangular matrix, we shall restrict our treatment to the case when all the matrices are square, 
with dimensions N x N. 

Like the other matrix factorizations we have met ( LU , SVD, Cholesky), QR decompo¬ 
sition can be used to solve systems of linear equations. To solve 

A x = b (2.10.3) 

first form Q T • b and then solve 

R x = Q t b (2.10.4) 

by backsubstitution. Since QR decomposition involves about twice as many operations as 
LU decomposition, it is not used for typical systems of linear equations. However, we will 
meet special cases where QR is the method of choice. 
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The standard algorithm for the QR decomposition involves successive Householder 
transformations (to be discussed later in §11.2). We write a Householder matrix in the form 
1 — u ® u/c where c = |u • u. An appropriate Householder matrix applied to a given matrix 
can zero all elements in a column of the matrix situated below a chosen element. Thus we 
arrange for the first Householder matrix Q, to zero all elements in the first column of A below 
the first element. Similarly Q 2 zeroes all elements in the second column below the second 
element, and so on up to Q n _!. Thus 

R — Q„ , Q, A (2.10.5) 

Since the Householder matrices are orthogonal, 

Q = (Q„_i • • • Qi) -1 = Qi Qn-i (2.10.6) 

In most applications we don’t need to form Q explicitly; we instead store it in the factored 
form (2.10.6). Pivoting is not usually necessary unless the matrix A is very close to singular. 
A general QR algorithm for rectangular matrices including pivoting is given in [1 ]. For square 
matrices, an implementation is the following: 



#include <math.h> 

#include "nrutil.h" 

void qrdcmp(float **a, int n, float *c, float *d, int *sing) 

Constructs the QR decomposition of a[l. .n] [1. .n] . The upper triangular matrix R is re¬ 
turned in the upper triangle of a, except for the diagonal elements of R which are returned in 
d[l. . n] . The orthogonal matrix Q is represented as a product of n — 1 Householder matrices 
Qi • • • Q„_i, where = 1 Uj (g> Uj fcj. The ith component of u j is zero for i = 1,..., j — 1 
while the nonzero components are returned in a[i] [j] for i .== sing returns as 

true (1) if singularity is encountered during the decomposition, but the decomposition is still 
completed in this case; otherwise it returns false (0). 

{ 

int i,j,k; 

float scale,sigma,sum,tau; 

*sing=0; 

for (k=l;k<n;k++) { 
scale=0.0; 

for (i=k;i<=n;i++) scale=FMAX(scale,fabs(a[i][k])); 
if (scale ==0.0) { Singular case. 

*sing=l; 
c [k] =d [k] =0.0; 

> else { Form Q k and Q fc • A. 

for (i=k;i<=n;i++) a[i][k] /= scale; 

for (sum=0.0,i=k;i<=n;i++) sum += SQR(a[i][k]); 

sigma=SIGN(sqrt(sum),a[k][k]); 

a[k] [k] += sigma; 

c[k]=sigma*a[k] [k] ; 

d[k] = -scale*sigma; 

for (j=k+l;j<=n;j++) { 

for (sum=0.0,i=k;i<=n;i++) sum += a[i][k]*a[i][j]; 
tau=sum/c[k]; 

for (i=k;i<=n;i++) a[i] [j] -= tau*a[i] [k] ; 

> 

> 

> 

d [n] =a [n] [n] ; 

if (d[n] == 0.0) *sing=l; 

> 



The next routine, qrsolv, is used to solve linear systems. In many applications only the 
part (2.10.4) of the algorithm is needed, so we separate it off into its own routine rsolv. 


Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 





100 


Chapter 2. Solution of Linear Algebraic Equations 


void qrsolv(float **a, int n, float c[], float d[] , float b[]) 

Solves the set of n linear equations A • x = b. a[l. .n] [1. .n] , c [1. .n] , and d[l. .n] are 
input as the output of the routine qrdcmp and are not modified. b[l. .n] is input as the 
right-hand side vector, and is overwritten with the solution vector on output. 

{ 

void rsolvffloat **a, int n, float d[] , float b[]); 
int i, j ; 
float sum,tau; 

for (j=l ; j<n; j++) { Form Q T • b. 

for (sum=0.0,i=j;i<=n;i++) sum += a[i][j]*b[i]; 
tau=sum/c[j] ; 

for (i=j ;i<=n; i++) b[i] -= tau*a[i] [j] ; 

> 

rsolv(a,n,d,b) ; Solve R • x = Q T ■ b. 

> 


void rsolvffloat **a, int n, float d[], float b[]) 

Solves the set of n linear equations R • x = b, where R is an upper triangular matrix stored in 
a and d. a[l. .n] [1. .n] and d[l. .n] are input as the output of the routine qrdcmp and 
are not modified. b[l. .n] is input as the right-hand side vector, and is overwritten with the 
solution vector on output. 

{ 

int i, j ; 
float sum; 

b [n] /= d [n] ; 

for (i=n-l;i>=l;i—) { 

for (sum=0.0,j=i+l;j<=n;j++) sum += a[i][j]*b[j]; 
b [i] = (b [i] -sum) /d [i] ; 

> 

> 

See [2] for details on how to use QR decomposition for constructing orthogonal bases, 
and for solving least-squares problems. (We prefer to use SVD, §2.6, for these purposes, 
because of its greater diagnostic capability in pathological cases.) 

Updating a QR decomposition 

Some numerical algorithms involve solving a succession of linear systems each of which 
differs only slightly from its predecessor. Instead of doing 0(N 3 ) operations each time 
to solve the equations from scratch, one can often update a matrix factorization in 0(N 2 ) 
operations and use the new factorization to solve the next set of linear equations. The LU 
decomposition is complicated to update because of pivoting. However, QR turns out to be 
quite simple for a very common kind of update, 

A^A + s<g>t (2.10.7) 

(compare equation 2.7.1). In practice it is more convenient to work with the equivalent form 
A = Q R -> A' = Q' R' =Q (R + u®v) (2.10.8) 

One can go back and forth between equations (2.10.7) and (2.10.8) using the fact that Q 
is orthogonal, giving 

t = v and either s = Q u or u = Q T • s (2.10.9) 

The algorithm [2] has two phases. In the first we apply IV — 1 Jacobi rotations (§11.1) to 
reduce R + u <g> v to upper Hessenberg form. Another N — 1 Jacobi rotations transform this 
upper Hessenberg matrix to the new upper triangular matrix R'. The matrix Q' is simply the 
product of Q with the 2 (N — 1) Jacobi rotations. In applications we usually want Q T , and 
the algorithm can easily be rearranged to work with this matrix instead of with Q. 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



2.10 QR Decomposition 


101 


#include <math.h> 

#include "nrutil.h" 

void qrupdtffloat **r, float **qt, int n, float u[] , float v[]) 

Given the QR decomposition of some n x n matrix, calculates the QR decomposition of the 
matrix Q- (R + uig>v). The quantities are dimensioned as r [1. .n] [1. .n] , qt [1. .n] [1. .n] , 
u[l. .n], and v[l. .n]. Note that Q T is input and returned in qt. 

{ 

void rotate(float **r, float **qt, int n, int i, float a, float b); 

int i,j,k; 

for (k=n;k>=l;k—) { Find largest k such that u[k] ^ 0. 

if (u[k]) break; 

> 

if (k < 1) k=l; 

for (i=k-lji>=l;i—) { Transform R + u <g> v to upper Hessenberg. 

rotatefr.qt.n.ijU^] ,-u[i+l]) ; 
if (u[i] == 0.0) u[i]=fabs(u[i+l]); 
else if (fabs(u[i]) > fabs(u[i+l])) 

n[i]=fabs(u[i] )*sqrt (1.0+SQR(u[i+l] /u[i] )); 
else u[i] =fabs(u[i+l] )*sqrt (1.0+SQR(u[i] /u[i+l])) ; 

> 

for (j=l; j<=n; j++) r[l] [j] +=u[l]*v[j]; 

for (i=l; i<k; i++) Transform upper Hessenberg matrix to upper tri- 

rotate(r,qt,n,i,r [i] [i] ,-r [i+1] [i]); angular. 

> 


#include <math.h> 

#include "nrutil.h" 

void rotateffloat **r, float **qt, int n, int i, float a, float b) 

Given matrices r [1. .n] [1. .n] and qt [1. .n] [1. .n] , carry out a Jacobi rotation on rows 
i and i + 1 of each matrix, a and b are the parameters of the rotation: cos# = a/\/a 2 + b 2 , 
sin# = b/Va' 2 + b 2 . 

{ 

int j; 

float c,fact 

if (a == 0.0) { Avoid unnecessary overflow or underflow. 

c=0.0; 

s=(b >= 0.0 ? 1.0 : -1.0); 

> else if (fabs(a) > fabs(b)) { 

fact=b/a; 

c=SIGN(1.0/sqrt(1.0+(fact*fact)),a); 
s=fact*c; 

> else { 

fact=a/b; 

s=SIGN(1.0/sqrt(1.0+(fact*fact)) , b); 
c=fact*s; 

> 

for (j=i; j<=n; j++) { Premultiply r by Jacobi rotation. 

y =r Hi [ji; 

w=r [i+1] [j] ; 
r[i] [jl=c*y-s*w; 
r[i+1] [j]=s*y+c*w; 

> 

for (j=l;j<=n;j++) { Premultiply qt by Jacobi rotation. 

y=qt [i] [jl ; 
w=qt [i+1] [j] ; 
qt [i] [j]=c*y-s*w; 
qt[i+l] [j]=s*y+c*w; 

J 

> 
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We will make use of QR decomposition, and its updating, in § 9 . 7 . 


CITED REFERENCES AND FURTHER READING: 

Wilkinson, J.H., and Reinsch, C. 1971, Linear Algebra, vol. II of Handbook for Automatic Com¬ 
putation (New York: Springer-Verlag), Chapter 1/8. [1] 

Golub, G.H., and Van Loan, C.F. 1989, Matrix Computations, 2nd ed. (Baltimore: Johns Hopkins 
University Press), §§5.2, 5.3, 12.6. [2] 


2.11 Is Matrix Inversion an N 3 Process? 

We close this chapter with a little entertainment, a bit of algorithmic prestidig¬ 
itation which probes more deeply into the subject of matrix inversion. We start 
with a seemingly simple question: 

How many individual multiplications does it take to perform the matrix mul¬ 
tiplication of two 2x2 matrices, 


f ail 012^1 / fell fel 2 \ _ f Cn 

\ a21 022 / \ &21 i>22 ) V ° 21 

Eight, right? Here they are written explicitly: 

Cll = On x fen + ai2 x fe21 
Cl2 = ail X fel2 + 012 x fe 2 2 
C21 = 021 X fen + a22 x fe 2 l 
C 22 = 021 X fei 2 + (X 22 X fe22 

Do you think that one can write formulas for the c’s that involve only seven 
multiplications? (Try it yourself, before reading on.) 

Such a set of formulas was, in fact, discovered by Strassen [1 ]. The formulas are: 

Q1 = (on + 022) x (fen + fe 2 2) 

Q2 = (021 + 022) x fen 
Qa = on x (fei2 — (>22) 

Qa = 022 X ( — fell + & 2 l) 

Qh = (On + 012) X fe 2 2 
Qe = (—an + 021) x (fen + fei 2 ) 

Qr = (012 — 022) x (fe 2 i + fe22) 


C12 

C22 


( 2 . 11 . 1 ) 


( 2 . 11 . 2 ) 



( 2 . 11 . 3 ) 
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in terms of which 


cn = Qi + Qi — Q*> + Q7 

C21 = Q2 + Qa 

C12 = Q3 + Qs 

C 22 = Qi + Q3 — Q2 + Qe 


(2.11.4) 


What’s the use of this? There is one fewer multiplication than in equation 
(2.11.2), but many more additions and subtractions. It is not clear that anything 
has been gained. But notice that in (2.11.3) the a’s and b’s are never commuted. 
Therefore (2.11.3) and (2.11.4) are valid when the a’s and b’s are themselves matrices. 
The problem of multiplying two very large matrices (of order N = 2 m for some 
integer m) can now be broken down recursively by partitioning the matrices into 
quarters, sixteenths, etc. And note the key point: The savings is not just a factor 
“7/8”; it is that factor at each hierarchical level of the recursion. In total it reduces 
the process of matrix multiplication to order N ]ug2 7 instead of N 3 . 

What about all the extra additions in (2.11.3)—(2.11.4)? Don’t they outweigh 
the advantage of the fewer multiplications? For large N, it turns out that there are 
six times as many additions as multiplications implied by (2.11.3)—(2.11.4). But, 
if N is very large, this constant factor is no match for the change in the exponent 
from N 3 to N iog2 7 . 

With this “fast” matrix multiplication, Strassen also obtained a surprising result 
for matrix inversion [1 ]. Suppose that the matrices 


/ ail 

ai2 \ 

and 

( 011 

C12\ 

\ a 21 

a22 ) 

\ 021 

C22j 


are inverses of each other. Then the c’s can be obtained from the a’s by the following 
operations (compare equations 2.7.22 and 2.7.25): 


Ri = Inverse(an) 

7?2 = a 21 x Ri 
R3 = Ri x &12 
Ri = «21 X R 3 
R5 = Ri — CL22 

Rq — Inverse (7? 5 ) (2.11.6) 

C12 = R 3 x Rq 

C21 = RqX. R2 

R7 = R3 x C 2 1 

C11 = Ri- R7 

C22 = — 7?6 
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In (2.11.6) the “inverse” operator occurs just twice. It is to be interpreted as the 
reciprocal if the a’s and c’s are scalars, but as matrix inversion if the a’s and c’s are 
themselves submatrices. Imagine doing the inversion of a very large matrix, of order 
N = 2 m , recursively by partitions in half. At each step, halving the order doubles 
the number of inverse operations. But this means that there are only N divisions in 
all! So divisions don’t dominate in the recursive use of (2.11.6). Equation (2.11.6) 
is dominated, in fact, by its 6 multiplications. Since these can be done by an iV log2 7 
algorithm, so can the matrix inversion! 

This is fun, but let’s look at practicalities: If you estimate how large N has to be 
before the difference between exponent 3 and exponent log 2 7 = 2.807 is substantial 
enough to outweigh the bookkeeping overhead, arising from the complicated nature 
of the recursive Strassen algorithm, you will find that LU decomposition is in no 
immediate danger of becoming obsolete. 

If, on the other hand, you like this kind of fun, then try these: (1) Can you 
multiply the complex numbers (a+ib) and (c+ id) in only three real multiplications? 
[Answer: see §5.4.] (2) Can you evaluate a general fourth-degree polynomial in 
x for many different values of x with only three multiplications per evaluation? 
[Answer: see §5.3.] 

CITED REFERENCES AND FURTHER READING: 

Strassen, V. 1969, Numerische Mathematik , vol. 13, pp. 354-356. [1] 

Kronsjo, L. 1987, Algorithms: Their Complexity and Efficiency, 2nd ed. (New York: Wiley). 
Winograd, S. 1971, Linear Algebra and Its Applications, vol. 4, pp. 381-388. 

Pan, V. Ya. 1980, SIAM Journal on Computing, vol. 9, pp. 321-342. 

Pan, V. 1984, How to Multiply Matrices Faster, Lecture Notes in Computer Science, vol. 179 
(New York: Springer-Verlag) 

Pan, V. 1984, SIAM Review, vol. 26, pp. 393-415. [More recent results that show that an exponent 
of 2.496 can be achieved — theoretically!] 
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Chapter 3. Interpolation and 
Extrapolation 

3.0 Introduction 


We sometimes know the value of a function /( x) at a set of points x 1 , X 2 , ■ ■ ■, xn 
(say, with x\ < ... < Xn), but we don’t have an analytic expression for f(x) that lets 
us calculate its value at an arbitrary point. For example, the /(a; ,)’s might result from 
some physical measurement or from long numerical calculation that cannot be cast 
into a simple functional form. Often the xfs are equally spaced, but not necessarily. 

The task now is to estimate f(x) for arbitrary x by, in some sense, drawing a 
smooth curve through (and perhaps beyond) the x *. If the desired x is in between the 
largest and smallest of the xfs, the problem is called interpolation; if x is outside 
that range, it is called extrapolation, which is considerably more hazardous (as many 
former stock-market analysts can attest). 

Interpolation and extrapolation schemes must model the function, between or 
beyond the known points, by some plausible functional form. The form should 
be sufficiently general so as to be able to approximate large classes of functions 
which might arise in practice. By far most common among the functional forms 
used are polynomials (§3.1). Rational functions (quotients of polynomials) also turn 
out to be extremely useful (§3.2). Trigonometric functions, sines and cosines, give 
rise to trigonometric interpolation and related Fourier methods, which we defer to 
Chapters 12 and 13. 

There is an extensive mathematical literature devoted to theorems about what 
sort of functions can be well approximated by which interpolating functions. These 
theorems are, alas, almost completely useless in day-to-day work: If we know 
enough about our function to apply a theorem of any power, we are usually not in 
the pitiful state of having to interpolate on a table of its values! 

Interpolation is related to, but distinct from , function approximation. That task 
consists of finding an approximate (but easily computable) function to use in place 
of a more complicated one. In the case of interpolation, you are given the function / 
at points not of your own choosing. For the case of function approximation, you are 
allowed to compute the function / at any desired points for the purpose of developing 
your approximation. We deal with function approximation in Chapter 5. 

One can easily find pathological functions that make a mockery of any interpo¬ 
lation scheme. Consider, for example, the function 

/( x) = 3x 2 + ^ In [(tt - xf\ 4- 1 (3.0.1) 
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which is well-behaved everywhere except at x = tt, very mildly singular at x = n, 
and otherwise takes on all positive and negative values. Any interpolation based on 
the values x = 3.13,3.14,3.15,3.16, will assuredly get a very wrong answer for 
the value x = 3.1416, even though a graph plotting those five points looks really 
quite smooth! (Try it on your calculator.) 

Because pathologies can lurk anywhere, it is highly desirable that an interpo¬ 
lation and extrapolation routine should provide an estimate of its own error. Such 
an error estimate can never be foolproof, of course. We could have a function that, 
for reasons known only to its maker, takes off wildly and unexpectedly between 
two tabulated points. Interpolation always presumes some degree of smoothness 
for the function interpolated, but within this framework of presumption, deviations 
from smoothness can be detected. 

Conceptually, the interpolation process has two stages: (1) Fit an interpolating 
function to the data points provided. (2) Evaluate that interpolating function at 
the target point x. 

However, this two-stage method is generally not the best way to proceed in 
practice. Typically it is computationally less efficient, and more susceptible to 
roundoff error, than methods which construct a functional estimate f(x) directly 
from the N tabulated values every time one is desired. Most practical schemes start 
at a nearby point f(xt ), then add a sequence of (hopefully) decreasing corrections, 
as information from other /(x,)’s is incorporated. The procedure typically takes 
0(N 2 ) operations. If everything is well behaved, the last correction will be the 
smallest, and it can be used as an informal (though not rigorous) bound on the error. 

In the case of polynomial interpolation, it sometimes does happen that the 
coefficients of the interpolating polynomial are of interest, even though their use 
in evaluating the interpolating function should be frowned on. We deal with this 
eventuality in §3.5. 

Local interpolation, using a finite number of “nearest-neighbor” points, gives 
interpolated values f(x) that do not, in general, have continuous first or higher 
derivatives. That happens because, as x crosses the tabulated values x u the 
interpolation scheme switches which tabulated points are the “local” ones. (If such 
a switch is allowed to occur anywhere else , then there will be a discontinuity in the 
interpolated function itself at that point. Bad idea!) 

In situations where continuity of derivatives is a concern, one must use 
the “stiffer” interpolation provided by a so-called spline function. A spline is 
a polynomial between each pair of table points, but one whose coefficients are 
determined “slightly” nonlocally. The nonlocality is designed to guarantee global 
smoothness in the interpolated function up to some order of derivative. Cubic splines 
(§3.3) are the most popular. They produce an interpolated function that is continuous 
through the second derivative. Splines tend to be stabler than polynomials, with less 
possibility of wild oscillation between the tabulated points. 

The number of points (minus one) used in an interpolation scheme is called 
the order of the interpolation. Increasing the order does not necessarily increase 
the accuracy, especially in polynomial interpolation. If the added points are distant 
from the point of interest x, the resulting higher-order polynomial, with its additional 
constrained points, tends to oscillate wildly between the tabulated values. This 
oscillation may have no relation at all to the behavior of the “true” function (see 
Figure 3.0.1). Of course, adding points close to the desired point usually does help. 
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Figure 3.0.1. (a) A smooth function (solid line) is more accurately interpolated by a high-order 

polynomial (shown schematically as dotted line) than by a low-order polynomial (shown as a piecewise 
linear dashed line), (b) A function with sharp comers or rapidly changing higher derivatives is less 
accurately approximated by a high-order polynomial (dotted line), which is too “stiff,” than by a low-order 
polynomial (dashed lines). Even some smooth functions, such as exponentials or rational functions, can 
be badly approximated by high-order polynomials. 


but a finer mesh implies a larger table of values, not always available. 

Unless there is solid evidence that the interpolating function is close in form to 
the true function /, it is a good idea to be cautious about high-order interpolation. 
We enthusiastically endorse interpolations with 3 or 4 points, we are perhaps tolerant 
of 5 or 6; but we rarely go higher than that unless there is quite rigorous monitoring 
of estimated errors. 

When your table of values contains many more points than the desirable order 
of interpolation, you must begin each interpolation with a search for the right “local” 
place in the table. While not strictly a part of the subject of interpolation, this task is 
important enough (and often enough botched) that we devote §3.4 to its discussion. 

The routines given for interpolation are also routines for extrapolation. An 
important application, in Chapter 16, is their use in the integration of ordinary 
differential equations. There, considerable care is taken with the monitoring of 
errors. Otherwise, the dangers of extrapolation cannot be overemphasized: An 
interpolating function, which is perforce an extrapolating function, will typically go 
berserk when the argument x is outside the range of tabulated values by more than 
the typical spacing of tabulated points. 

Interpolation can be done in more than one dimension, e.g., for a function 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



108 


Chapter 3. Interpolation and Extrapolation 


f(x, y, z ). Multidimensional interpolation is often accomplished by a sequence of 
one-dimensional interpolations. We discuss this in §3.6. 


CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), §25.2. 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
Chapter 2. 

Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), Chapter 3. 

Kahaner, D., Moler, C., and Nash, S. 1989, Numerical Methods and Software (Englewood Cliffs, 
NJ: Prentice Hall), Chapter 4. 

Johnson, L.W., and Riess, R.D. 1982, Numerical Analysis, 2nd ed. (Reading, MA: Addison- 
Wesley), Chapter 5. 

Ralston, A., and Rabinowitz, P. 1978, A First Course in Numerical Analysis, 2nd ed. (New York: 
McGraw-Hill), Chapter 3. 

Isaacson, E., and Keller, H.B. 1966, Analysis of Numerical Methods (New York: Wiley), Chapter 6. 


3.1 Polynomial Interpolation and Extrapolation 

Through any two points there is a unique line. Through any three points, a 
unique quadratic. Et cetera. The interpolating polynomial of degree N — 1 through 
the N points t/i = f(xi),y 2 = f(x 2 ),...,y N = f(x N ) is given explicitly by 
Lagrange’s classical formula. 


P(x) 


(x - x 2 )(x - x 3 )...(x - x N ) + (x - x x )(x - x 3 )...(x - X N ) 

(xi - X 2 ){x\ - x 3 )...(xi - x N ) y (x 2 - Xi)(x 2 - x 3 )...(x 2 - x N ) y 


_!_ (x - xi)(x - x 2 )...(x - x N - 1) 

(x N - xi)(x N - x 2 )...(x N - Xjv-i ) y 

(3.1.1) 

There are N terms, each a polynomial of degree N — 1 and each constructed to be 
zero at all of the x^ except one, at which it is constructed to be y t . 

It is not terribly wrong to implement the Lagrange formula straightforwardly, 
but it is not terribly right either. The resulting algorithm gives no error estimate, and 
it is also somewhat awkward to program. A much better algorithm (for constructing 
the same, unique, interpolating polynomial) is Neville’s algorithm, closely related to 
and sometimes confused with Aitken ’s algorithm, the latter now considered obsolete. 

Let Pi be the value at x of the unique polynomial of degree zero (i.e., 
a constant) passing through the point (xi,yi); so Pi = y\. Likewise define 
P 2 , P3,..., Pjv- Now let P12 be the value at x of the unique polynomial of 
degree one passing through both (xi,yi) and (x 2 ,y 2 ). Likewise P23. P34,..., 
P(N- 1) ,v • Similarly, for higher-order polynomials, up to P123...W, which is the value 
of the unique interpolating polynomial through all N points, i.e., the desired answer. 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



3.1 Polynomial Interpolation and Extrapolation 


109 


The various P’s form a “tableau” with “ancestors” on the left leading to a single 
“descendant” at the extreme right. For example, with N = 4 , 


X\ : 

yi = Pi 

P12 



x 2 : 

2/2 = P2 


P123 




P23 

P1234 

(3.1.2) 

x 3 : 

2/3 = P 3 

P34 

P234 


X4 : 

2/4 = P4 





Neville’s algorithm is a recursive way of filling in the numbers in the tableau 
a column at a time, from left to right. It is based on the relationship between a 
“daughter” P and its two “parents,” 


Pi(i+l)...(i+m) 


(x - Xj +m )Pi( i+ i)...(j+ m -i) + (Xj — a:)P(i + l)(j + 2)...(i + m) 

(3.1.3) 


This recurrence works because the two parents already agree at points x»+i... 

•Ki+m— 1* 

An improvement on the recurrence (3.1.3) is to keep track of the small 
differences between parents and daughters, namely to define (for m = 1,2,..., 
N - 1), 

C m ,i = ~ 1 ) 

D m ,i = P%...{i+m) ~ P{i+i)...{i+m)- (3.1.4) 


Then one can easily derive from (3.1.3) the relations 

(Xi +m+ l - x)(C m ,i+ 1 - Drn.i, 


Dm+l,i — ' 


C m 


%i %i+m +1 

; (xj-x)(C mti+1 -D mti ) 


(3.1.5) 


At each level m, the C’s and P’s are the corrections that make the interpolation one 
order higher. The final answer P\...n is equal to the sum of any yi plus a set of C’s 
and/or P’s that form a path through the family tree to the rightmost daughter. 

Here is a routine for polynomial interpolation or extrapolation from N input 
points. Note that the input arrays are assumed to be unit-offset. If you have 
zero-offset arrays, remember to subtract 1 (see §1.2): 


#include <math.h> 

#include "nrutil.h" 

void polint(float xa[] , float ya[] , int n, float x, float *y, float *dy) 

Given arrays xa[l . .n] and ya[l . .n] , and given a value x, this routine returns a value y, and 
an error estimate dy. If P(x) is the polynomial of degree N — 1 such that P(xaj) = ya^i = 
1, ...,n, then the returned value y = -P(x). 

{ 

int i,m,ns=l; 

float den,dif,dift,ho,hp,w; 
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float *c,*d; 


> 


dif=fabs(x-xa[l]); 
c=vector(l,n); 
d=vector(l,n); 
for (i=l;i<=n;i++) { 

if ( (dift=fabs(x-xa[i])) 
ns=i; 


Here we find the index ns of the closest table entry, 
< dif) { 


dif=dift; 

> 

c[i]=ya[i] ; and initialize the tableau of c's and d's. 

d[i]=ya[i] ; 


*y=ya[ns—] ; 

for (m=l;m<n;m++) { 

for (i=l;i<=n-m;i++) { 
ho=xa[i]-x; 
hp=xa[i+m]-x; 
w=c [i+1] -d[i] ; 


This is the initial approximation to y. 

For each column of the tableau, 
we loop over the current c's and d's and update 
them. 


if ( (den=ho-hp) == 0.0) nrerrorf"Error in routine polint"); 

This error can occur only if two input xa's are (to within roundoff) identical. 
den=w/den; 

d[i]=hp*den; Here the c's and d's are updated. 

c[i]=ho*den; 


> 


*y += (*dy=(2*ns < (n-m) ? c[ns+l] : d[ns—])); 

After each column in the tableau is completed, we decide which correction, c or d, 
we want to add to our accumulating value of y, i.e., which path to take through the 
tableau—forking up or down. We do this in such a way as to take the most "straight 
line" route through the tableau to its apex, updating ns accordingly to keep track of 
where we are. This route keeps the partial approximations centered (insofar as possible) 
on the target x. The last dy added is thus the error indication. 

> 

free_vector(d,l,n); 
free_vector(c,l,n); 


Quite often you will want to call polint with the dummy arguments xa 
and ya replaced by actual arrays with offsets. For example, the construction 
polint (&xx [14] , &yy [14] , 4, x, y, dy) performs 4-point interpolation on the tab¬ 
ulated values xx [15. . 18], yy [15. . 18]. For more on this, see the end of §3.4. 


CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), §25.2. 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 

§ 2 . 1 . 

Gear, C.W. 1971, Numerical Initial Value Problems in Ordinary Differential Equations (Englewood 
Cliffs, NJ: Prentice-Hall), §6.1. 
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3.2 Rational Function Interpolation and 
Extrapolation 


Some functions are not well approximated by polynomials, but are well 
approximated by rational functions, that is quotients of polynomials. We de¬ 
note by Riu+i)...(i+ m ) a rational function passing through the m + 1 points 
(xi, yi )... (xi +rn . y-i+rn). More explicitly, suppose 


R _ Pfijx) _ Pq + P!X-\ - 'rp^ 

»(»+!)...(*+«.) Q v (x) q 0 + qiX-\ -f q v x v 


(3.2.1) 


Since there are p + v + 1 unknown p’s and q’s (qo being arbitrary), we must have 


m+1 = p+ v +l 


(3.2.2) 


In specifying a rational function interpolating function, you must give the desired 
order of both the numerator and the denominator. 

Rational functions are sometimes superior to polynomials, roughly speaking, 
because of their ability to model functions with poles, that is, zeros of the denominator 
of equation (3.2.1). These poles might occur for real values of x, if the function 
to be interpolated itself has poles. More often, the function f(x) is finite for all 
finite real x, but has an analytic continuation with poles in the complex x-plane. 
Such poles can themselves ruin a polynomial approximation, even one restricted to 
real values of x, just as they can ruin the convergence of an infinite power series 
in x. If you draw a circle in the complex plane around your m tabulated points, 
then you should not expect polynomial interpolation to be good unless the nearest 
pole is rather far outside the circle. A rational function approximation, by contrast, 
will stay “good” as long as it has enough powers of x in its denominator to account 
for (cancel) any nearby poles. 

For the interpolation problem, a rational function is constructed so as to go 
through a chosen set of tabulated functional values. However, we should also 
mention in passing that rational function approximations can be used in analytic 
work. One sometimes constructs a rational function approximation by the criterion 
that the rational function of equation (3.2.1) itself have a power series expansion 
that agrees with the first m + 1 terms of the power series expansion of the desired 
function f(x). This is called Pade approximation, and is discussed in §5.12. 


Bulirsch and Stoer found an algorithm of the Neville type which performs 
rational function extrapolation on tabulated data. A tableau like that of equation 
(3.1.2) is constructed column by column, leading to a result and an error estimate. 
The Bulirsch-Stoer algorithm produces the so-called diagonal rational function, with 
the degrees of numerator and denominator equal (if m is even) or with the degree 
of the denominator larger by one (if m is odd, cf. equation 3.2.2 above). For the 
derivation of the algorithm, refer to [1 ]. The algorithm is summarized by a recurrence 



S, § g 
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relation exactly analogous to equation (3.1.3) for polynomial approximation: 


Ri(i+l)...(i+m) — R{i+l)...{i+m) 


V'+ll [. + »■! 


-Ri 


.(i+ra—1) 


( as-Xj ) (l — ) __ 

\X-Xi + m ) R(i + l)...(i + m)-R(i + l)...(i + m-l) J 

(3.2.3) 


This recurrence generates the rational functions through m + 1 points from the ones 
through m and (the term ( i+m _ i) in equation 3.2.3) m — 1 points. It is started 

with 

Ri = Vi (3.2.4) 

and with 

R = [Ri(i + i)...( i+m ) with m = -1] = 0 (3.2.5) 


Now, exactly as in equations (3.1.4) and (3.1.5) above, we can convert the 
recurrence (3.2.3) to one involving only the small differences 

Cm,i = Ri...{i+m) — Ri...{i+m- 1) 

(3.2.6) 

D m ,i — Ri...(i+m) -^(i+l)...(i+m) 


Note that these satisfy the relation 

Cm+l ,i Dm+l,i = Cro,i+1 Dm,i 

which is useful in proving the recurrences 

Cm,i+l(C m , i+1 - D m!i ) 


Dm 




(3.2.7) 


(3.2.8) 


This recurrence is implemented in the following function, whose use is analogous 
in every way to polint in §3.1. Note again that unit-offset input arrays are 
assumed (§1.2). 


#include <math.h> 

#include "nrutil.h" 

#define TINY 1.0e-25 A small number. 

#define FREERETURN tfree_vector(d,1,n);free_vector(c,1,n);return;} 

void ratint(float xa[] , float ya[] , int n, float x, float *y, float *dy) 

Given arrays xa[l. .n] and ya[l. . n] , and given a value of x, this routine returns a value of 
y and an accuracy estimate dy. The value returned is that of the diagonal rational function, 
evaluated at x, which passes through the n points (xa^ya^), i = l...n. 

{ 

int m,i,ns=l; 

float w,t,hti,h > dd,*c > *d; 
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} 


c=vector(l,n); 
d=vector(l,n); 
hh=fabs(x-xa[l]); 
for (i=l;i<=n;i++) { 
h=fabs(x-xa[i] ) ; 
if (h == 0.0) { 

*y=ya[i] ; 

*dy=0.0; 

FREERETURN 
} else if (h < bh) { 
ns=i; 
bh=h; 

} 

c[i]=ya[i] ; 

d[i]=ya[i]+TINY; The TINY part is needed to prevent a rare zero-over-zero 

> condition. 

*y=ya[ns—]; 

for (m=l;m<n;m++) { 

for (i=l;i<=n-m;i++) { 
w=c [i+1] -d[i] ; 

h=xa[i+m]-x; h will never be zero, since this was tested in the initial- 

t=(xa[i]-x)*d[i]/h; izing loop. 

dd=t-c[i+1]; 

if (dd == 0.0) nrerror("Error in routine ratint"); 

This error condition indicates that the interpolating function has a pole at the 

requested value of x. 

dd=w/dd; 

d[i] =c [i+1] *dd; 

c[i]=t*dd; 

} 

*y += (*dy=(2*ns < (n-m) ? c[ns+l] : d[ns—])); 


FREERETURN 


CITED REFERENCES AND FURTHER READING: 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 

§ 2 . 2 . [ 1 ] 

Gear, C.W. 1971, Numerical Initial Value Problems in Ordinary Differential Equations (Englewood 
Cliffs, NJ: Prentice-Hall), §6.2. 

Cuyt, A., and Wuytack, L. 1987, Nonlinear Methods in Numerical Analysis (Amsterdam: North- 
Holland), Chapter 3. 


3.3 Cubic Spline Interpolation 

Given a tabulated function y, = y(xi), i = 1...N, focus attention on one 
particular interval, between Xj and Xj+\. Linear interpolation in that interval gives 
the interpolation formula 



V = A Vj + B Vj +1 


(3.3.1) 
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where 


B=l-A=^±- (3.3.2) 

%j +1 %j %j ~|-1 %j 

Equations (3.3.1) and (3.3.2) are a special case of the general Lagrange interpolation 
formula (3.1.1). 

Since it is (piecewise) linear, equation (3.3.1) has zero second derivative in 
the interior of each interval, and an undefined, or infinite, second derivative at the 
abscissas Xj. The goal of cubic spline interpolation is to get an interpolation formula 
that is smooth in the first derivative, and continuous in the second derivative, both 
within an interval and at its boundaries. 

Suppose, contrary to fact, that in addition to the tabulated values of yi, we 
also have tabulated values for the function’s second derivatives, y" , that is, a set 
of numbers y". Then, within each interval, we can add to the right-hand side of 
equation (3.3.1) a cubic polynomial whose second derivative varies linearly from a 
value y" on the left to a value y" +1 on the right. Doing so, we will have the desired 
continuous second derivative. If we also construct the cubic polynomial to have 
zero values at Xj and £j+i, then adding it in will not spoil the agreement with the 
tabulated functional values y -j and y :l+ i at the endpoints x 3 and Xj+\. 

A little side calculation shows that there is only one way to arrange this 
construction, namely replacing (3.3.1) by 

y = Ay j +By j+1 +Cy';+Dy'^ 1 (3.3.3) 


where A and B are defined in (3.3.2) and 


C=\{A 3 - A)( Xj +% - Xj ) 2 D = i(£ 3 - B)(x j+1 - xj ) 2 (3.3.4) 

Notice that the dependence on the independent variable x in equations (3.3.3) and 
(3.3.4) is entirely through the linear ^-dependence of A and B, and (through A and 
B ) the cubic ^’-dependence of C and D. 

We can readily check that y" is in fact the second derivative of the new 
interpolating polynomial. We take derivatives of equation (3.3.3) with respect to x, 
using the definitions of A, B, C, D to compute dA/dx, dB/dx, dC/dx, and dD/dx. 
The result is 


dx 


Xj+l - Xj 


3A 2 — 1 w . 3B 2 - 1, w/ _ 

- t, (x j+ 1 - Xj) yj + {x j+1 - Xj)y j+1 (3.3.5) 


for the first derivative, and 


^ = Ay>! + By '' +1 (3.3.6) 

for the second derivative. Since A = 1 at xj, A = 0 at Xj+ 1 , while B is just the 
other way around, (3.3.6) shows that y" is just the tabulated second derivative, and 
also that the second derivative will be continuous across (e.g.) the boundary between 
the two intervals (xj i.Xj) and (xj,Xj + 1 ). 
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The only problem now is that we supposed the y"’s to be known, when, actually, 
they are not. However, we have not yet required that the first derivative, computed 
from equation (3.3.5), be continuous across the boundary between two intervals. The 
key idea of a cubic spline is to require this continuity and to use it to get equations 
for the second derivatives y". 

The required equations are obtained by setting equation (3.3.5) evaluated for 
x = Xj in the interval (xj- 1 , Xj) equal to the same equation evaluated for x = Xj but 
in the interval (x j,Xj+i). With some rearrangement, this gives (for j = 2,.... N— 1) 

x j ~ x o -1 „ // , x 3 +1 - x i -1 „ // , x i+i ~ x i // _ Vi +1 - Vj Vj ~ Vi~X 

6 Vj ~ 1+ 3 Vj + 6 Vj+1 ~ x j+1 -xj Xj — Xj—i 

(3.3.7) 

These are N — 2 linear equations in the N unknowns y",i = \..... N. Therefore 
there is a two-parameter family of possible solutions. 

For a unique solution, we need to specify two further conditions, typically taken 
as boundary conditions at x i and x n ■ The most common ways of doing this are either 

• set one or both of y" and y*L equal to zero, giving the so-called natural 
cubic spline, which has zero second derivative on one or both of its 
boundaries, or 

• set either of y'[ and y" N to values calculated from equation (3.3.5) so as 
to make the first derivative of the interpolating function have a specified 
value on either or both boundaries. 

One reason that cubic splines are especially practical is that the set of equations 
(3.3.7), along with the two additional boundary conditions, are not only linear, but 
also tridiagonal. Each y" is coupled only to its nearest neighbors at j ± 1. Therefore, 
the equations can be solved in O(N) operations by the tridiagonal algorithm (§2.4). 
That algorithm is concise enough to build right into the spline calculational routine. 
This makes the routine not completely transparent as an implementation of (3.3.7), 
so we encourage you to study it carefully, comparing with tridag (§2.4). Arrays 
are assumed to be unit-offset. If you have zero-offset arrays, see §1.2. 

#include "nrutil.h" 

void spline (float x[], float y[], int n, float ypl, float ypn, float y2[]) 

Given arrays x[l. .n] and y[l. .n] containing a tabulated function, i.e., y i = /(x^), with 
Xi < X 2 < ... < Xjv, and given values ypl and ypn for the first derivative of the interpolating 
function at points 1 and n, respectively, this routine returns an array y2[l. .n] that contains 
the second derivatives of the interpolating function at the tabulated points x;. If ypl and/or 
ypn are equal to 1 x 10 30 or larger, the routine is signaled to set the corresponding boundary 
condition for a natural spline, with zero second derivative on that boundary. 

{ 

int i,k; 

float p,qn,sig,un,*u; 
n=vector(l,n-l); 

if (ypl > 0.99e30) The lower boundary condition is set either to be "nat- 

y2 [1] =u[l] =0.0; ural" 

else { or else to have a specified first derivative. 

y2[l] = -0.5; 

u[1] = (3.0/(x [2] -x[1] )) *( (y [2] -y [1]) /(x[2] -x [1] )-ypl) ; 



> 
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for (i=2;i<=n-l;i++) { This is the decomposition loop of the tridiagonal al- 

sig=(x[i] -x[i-l] )/(x[i+l]-x[i-l]); gorithm. y2 and u are used for tem- 

p=sig*y2[i-l]+2.0; porary storage of the decomposed 

y2 [i] = (sig-l. 0)/p; factors, 

u[i] = (y [i+1] -y [i] )/(x[i+lj-x[i] ) - (y [i]-y [i-1] )/(x[i]-x[i-l]); 
u[i] = (6.0*u[i] /(x[i+1] -x [i-1])-sig*u [i-1] )/p; 

> 

if (ypn > 0.99e30) The upper boundary condition is set either to be 

qn=un=0.0; “natural" 

else { or else to have a specified first derivative. 

qn=0.5; 

un= (3.0/ (x [n] -x [n-1] )) * (ypn- (y [n] -y [n-1]) / (x [n] -x [n-1] )); 

> 

y2 [n] = (un-qn*u[n-l] )/(qn*y2 [n-1] +1.0); 

for (k=n-l;k>=l;k—) This is the backsubstitution loop of the tridiagonal 

y2 [k] =y2 [k] *y2 [k+1] +u[k] ; algorithm. 

free_vector(u,l,n-l); 


It is important to understand that the program spline is called only once to 
process an entire tabulated function in arrays Xj and y,. Once this has been done, 
values of the interpolated function for any value of x are obtained by calls (as many 
as desired) to a separate routine splint (for “spline interpolation”): 

void splint(float xa[] , float ya[] , float y2a[] , int n, float x, float *y) 

Given the arrays xa[l. .n] and ya[l. . n] , which tabulate a function (with the xa;’s in order), 
and given the array y2a[l. .n] , which is the output from spline above, and given a value of 
x, this routine returns a cubic-spline interpolated value y. 

{ 

void nrerror(char error_text []); 
int klo,khi,k; 
float h,b,a; 

klo=l; 
khi=n; 

while (khi-klo > 1) { 
k=(khi+klo) » 1; 
if (xa[k] > x) khi=k; 
else klo=k; 

} 

h=xa [khi] -xa [klo] ; 

if (h == 0.0) nrerrorC'Bad xa input to routine splint"); The xa's must be dis- 
a=(xa[khi]-x)/h; tinct. 

b=(x-xa[klo] )/h; Cubic spline polynomial is now evaluated. 

*y=a*ya[klo]+b*ya[khi]+((a*a*a-a)*y2a[klo]+(b*b*b-b)*y2a[khi])*(h*h)/6.0; 


We will find the right place in the table by means of 
bisection. This is optimal if sequential calls to this 
routine are at random values of x. If sequential calls 
are in order, and closely spaced, one would do better 
to store previous values of klo and khi and test if 
they remain appropriate on the next call, 
klo and khi now bracket the input value of x. 


CITED REFERENCES AND FURTHER READING: 

De Boor, C. 1978, A Practical Guide to Splines (New York: Springer-Verlag). 

Forsythe, G.E., Malcolm, M.A., and Moler, C.B. 1977, Computer Methods for Mathematical 
Computations (Englewood Cliffs, NJ: Prentice-Hall), §§4.4-4.5. 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
§2.4. 

Ralston, A., and Rabinowitz, P. 1978, A First Course in Numerical Analysis, 2nd ed. (New York: 
McGraw-Hill), §3.8. 
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3.4 How to Search an Ordered Table 


Suppose that you have decided to use some particular interpolation scheme, 
such as fourth-order polynomial interpolation, to compute a function f(x) from a 
set of tabulated aj»’s and /»’s. Then you will need a fast way of finding your place 
in the table of Xi’s, given some particular value x at which the function evaluation 
is desired. This problem is not properly one of numerical analysis, but it occurs so 
often in practice that it would be negligent of us to ignore it. 

Formally, the problem is this: Given an array of abscissas xx [ j ], j = 1,2,... ,n, 
with the elements either monotonically increasing or monotonically decreasing, and 
given a number x, find an integer j such that x lies between xx [j] and xx [j+1]. 
For this task, let us define fictitious array elements xx [0] and xx [n+1] equal to 
plus or minus infinity (in whichever order is consistent with the monotonicity of the 
table). Then j will always be between 0 and n, inclusive; a value of 0 indicates 
“off-scale” at one end of the table, n indicates off-scale at the other end. 

In most cases, when all is said and done, it is hard to do better than bisection, 
which will find the right place in the table in about log 2 n tries. We already did use 
bisection in the spline evaluation routine splint of the preceding section, so you 
might glance back at that. Standing by itself, a bisection routine looks like this: 


void locate(float xx[], unsigned long n, float x, unsigned long *j) 

Given an array xx[l. .n] , and given a value x, returns a value j such that x is between xx[j] 
and xx [j+1] . xx must be monotonic, either increasing or decreasing. j=0 or j=n is returned 
to indicate that x is out of range. 

{ 

unsigned long ju,jm,jl; 
int asend; 


i 


jl=0; 
ju=n+l; 

ascnd=(xx[n] >= xx[l]); 
while (ju-jl > 1) { 
jm=(ju+jl) » 1; 
if (x >= xx [jm] == 

j 1= jm; 


Initialize lower 
and upper limits. 

If we are not yet done, 
compute a midpoint, 
asend) 

and replace either the lower limit 


else 


ju=jm; or the upper limit, as appropriate. 

> Repeat until the test condition is satisfied, 

if (x == xx [1]) *j=l; Then set the output 

else if(x == xx[n]) *j=n-l; 
else *j=jl; 

and return. 



A unit-offset array xx is assumed. To use locate with a zero-offset array, | E-2. 

remember to subtract 1 from the address of xx, and also from the returned value j. S, |_ 1 

®s' 

Search with Correlated Values 


Sometimes you will be in the situation of searching a large table many times, 
and with nearly identical abscissas on consecutive searches. For example, you 
may be generating a function that is used on the right-hand side of a differential 
equation: Most differential-equation integrators, as we shall see in Chapter 16, call 
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Figure 3.4.1. (a) The routine locate finds a table entry by bisection. Shown here is the sequence 

of steps that converge to element 51 in a table of length 64. (b) The routine hunt searches from a 
previous known position in the table by increasing steps, then converges by bisection. Shown here is a 
particularly unfavorable example, converging to element 32 from element 7. A favorable example would 
be convergence to an element near 7, such as 9, which would require just three “hops.” 


for right-hand side evaluations at points that hop back and forth a bit, but whose 
trend moves slowly in the direction of the integration. 

In such cases it is wasteful to do a full bisection, ab initio, on each call. The 
following routine instead starts with a guessed position in the table. It first “hunts,” 
either up or down, in increments of 1, then 2, then 4, etc., until the desired value is 
bracketed. Second, it then bisects in the bracketed interval. At worst, this routine is 
about a factor of 2 slower than locate above (if the hunt phase expands to include 
the whole table). At best, it can be a factor of log 2 n faster than locate, if the desired 
point is usually quite close to the input guess. Figure 3.4.1 compares the two routines. 


void hunt(float xx[], unsigned long n, float x, unsigned long *jlo) 

Given an array xx[l. .n], and given a value x, returns a value jlo such that x is between 
xx[jlo] and xx[jlo+l], xx[l..n] must be monotonic, either increasing or decreasing. 
jlo=0 or jlo=n is returned to indicate that x is out of range, jlo on input is taken as the 
initial guess for jlo on output. 

{ 

unsigned long jm,jhi,inc; 
int ascnd; 


ascnd=(xx[n] >= xx[l]); 
if (*jlo <=011 *jlo > n) { 
*jlo=0; 


True if ascending order of table, false otherwise. 
Input guess not useful. Go immediately to bisec¬ 
tion. 


jhi=n+l; 

> else { 

inc=l; Set the hunting increment, 

if (x >= xx[*jlo] == ascnd) { Hunt up: 
if (*jlo == n) return; 
jhi=(*jlo)+l; 

while (x >= xx[jhi] == ascnd) { Not done hunting, 

*jlo=jhi; 


inc += inc; 


so double the increment 


jhi=(*jlo)+inc; 

if (jhi > n) { Done hunting, since off end of table. 

jhi=n+l; 


break; 



> 


Try again. 
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> Done hunting, value bracketed. 

> else { Hunt down: 

if (*jlo == 1) { 

*jlo=0; 


> 


return; 

> 

jhi=(*jlo)—; 

while (x < xx[*jlo] == ascnd) { Not done hunting, 

jhi=(*jlo) ; 

inc «= 1; so double the increment 

if (inc >= jhi) { Done hunting, since off end of table. 

*jlo=0; 
break; 

> 


else *jlo=jhi-inc; 

> 

} 

> 

while (jhi-(*jlo) != 1) { 
jm=(jhi+(*jlo)) » 1; 
if (x >= xx [jm] == ascnd) 
*jlo=jm; 

else 

jhi=jm; 


and try again. 

Done hunting, value bracketed. 

Hunt is done, so begin the final bisection phase: 


if (x == xx[n]) *jlo=n-l; 
if (x == xx[1]) *jlo=l; 


If your array xx is zero-offset, read the comment following locate, above. 

After the Hunt 

The problem: Routines locate and hunt return an index j such that your 
desired value lies between table entries xx [j] and xx [j+1], where xx [1. . n] is the 
full length of the table. But, to obtain an m-point interpolated value using a routine 
like polint (§3.1) or ratint (§3.2), you need to supply much shorter xx and yy 
arrays, of length m. How do you make the connection? 

The solution: Calculate 

k = IMIN(IMAX(j-(m-l)/2,1),n+l-m) 

(The macros IMIN and IMAX give the minimum and maximum of two integer 
arguments; see §1.2 and Appendix B.) This expression produces the index of the 
leftmost member of an m-point set of points centered (insofar as possible) between 
j and j+1, but bounded by 1 at the left and n at the right. C then lets you call the 
interpolation routine with array addresses offset by k, e.g., 

polint(&xx[k—1],&yy[k—1],m,... ) 



CITED REFERENCES AND FURTHER READING: 

Knuth, D.E. 1973, Sorting and Searching, vol. 3 of The Art of Computer Programming (Reading, 
MA: Addison-Wesley), §6.2.1. 
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3.5 Coefficients of the Interpolating Polynomial 

Occasionally you may wish to know not the value of the interpolating polynomial 
that passes through a (small!) number of points, but the coefficients of that poly¬ 
nomial. A valid use of the coefficients might be, for example, to compute 
simultaneous interpolated values of the function and of several of its derivatives (see 
§5.3), or to convolve a segment of the tabulated function with some other function, 
where the moments of that other function (i.e., its convolution with powers of x) 
are known analytically. 

However, please be certain that the coefficients are what you need. Generally the 
coefficients of the interpolating polynomial can be determined much less accurately 
than its value at a desired abscissa. Therefore it is not a good idea to determine the 
coefficients only for use in calculating interpolating values. Values thus calculated 
will not pass exactly through the tabulated points, for example, while values computed 
by the routines in §3.1—§3.3 will pass exactly through such points. 

Also, you should not mistake the interpolating polynomial (and its coefficients) 
for its cousin, the best fit polynomial through a data set. Fitting is a smoothing 
process, since the number of fitted coefficients is typically much less than the 
number of data points. Therefore, fitted coefficients can be accurately and stably 
determined even in the presence of statistical errors in the tabulated values. (See 
§14.8.) Interpolation, where the number of coefficients and number of tabulated 
points are equal, takes the tabulated values as perfect. If they in fact contain statistical 
errors, these can be magnified into oscillations of the interpolating polynomial in 
between the tabulated points. 

As before, we take the tabulated points to be y , = y(x. t ). If the interpolating 
polynomial is written as 

y = c 0 + ctx + 

then the c,;s are required to satisfy the 

' 1 zo Xq 
1 xi x\ 

.1 x N x% 

This is a Vandermonde matrix, as described in §2.8. One could in principle solve 
equation (3.5.2) by standard techniques for linear equations generally (§2.3); however 
the special method that was derived in §2.8 is more efficient by a large factor, of 
order N, so it is much better. 

Remember that Vandermonde systems can be quite ill-conditioned. In such a 
case, no numerical method is going to give a very accurate answer. Such cases do 
not, please note, imply any difficulty in finding interpolated values by the methods 
of §3.1, but only difficulty in finding coefficients. 

Like the routine in §2.8, the following is due to G.B. Rybicki. Note that the 
arrays are all assumed to be zero-offset. 


c 2 a; -|-1- c N x n 

linear equation 


(3.5.1) 



' Co ' 


' yo ' 


Cl 

= 

yi 


-CJV- 


-vn- 


(3.5.2) 



s o- i 
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#include "nrutil.h" 


void polcoe(float x[], float y[], int n, float cof[]) 

Given arrays x[0. .n] and y[0. .n] containing a tabulated function y i = /(Xj), this routine 
returns an array of coefficients cof [0. .n] , such that y t = Yj cofyxl. 

{ 

int k, j,i; 
float phi,ff,b,*s; 


> 


s=vector(0,n); 

for (i=0;i<=n;i++) s[i]=cof[i]=0.0; 
s [n] = -x [0] ; 

for (i=l;i<=n;i++) { Coefficients Sj of the master polynomial P(x) are 

for (j=n-i;j<=n-l; j++) found by recurrence. 

s[j] -= x[i] *s [j+1] ; 
s[n] -= x [i] ; 


for (j=0;j<=n;j++) { 
phi=n+l; 

for (k=n;k>=l;k—) 

phi=k*s[k]+x[j]*phi; 
ff=y[j]/pbi; 
b=l.0; 

for (k=n;k>=0;k—) { 
cof [k] += b*ff; 
b=s [k] +x [j]*b; 

} 


The quantity phi = J \j^(xj — xk) is found as a 
derivative of P(xj). 

Coefficients of polynomials in each term of the La¬ 
grange formula are found by synthetic division of 
P(x) by (x — Xj). The solution c*. is accumu¬ 
lated. 


> 

free_vector(s,0,n); 


Another Method 

Another technique is to make use of the function value interpolation routine 
already given (polint §3.1). If we interpolate (or extrapolate) to find the value of 
the interpolating polynomial at x = 0, then this value will evidently be co- Now 
we can subtract Co from the y, 's and divide each by its corresponding x t . Throwing 
out one point (the one with smallest x % is a good candidate), we can repeat the 
procedure to find ci, and so on. 

It is not instantly obvious that this procedure is stable, but we have generally 
found it to be somewhat more stable than the routine immediately preceding. This 
method is of order N 3 , while the preceding one was of order N 2 . You will 
find, however, that neither works very well for large N, because of the intrinsic 
ill-condition of the Vandermonde problem. In single precision, N up to 8 or 10 is 
satisfactory; about double this in double precision. 

#include <math.h> 

#include "nrutil.h" 

void polcof (float xa[] , float ya[] , int n, float cof []) 

Given arrays xa[0. .n] and ya[0. .n] containing a tabulated function ya^ = /(xa^), this 
routine returns an array of coefficients cof [0. .n] such that ya^ = cof jxa|. 

{ 

void polint(float xa[], float ya[], int n, float x, float *y, float *dy); 
int k,j,i; 

float xmin,dy,*x,*y; 



S o- i 
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x=vector(0,n); 
y=vector(0,n); 
for (j=0;j<=n;j++) { 
x[j]=xa[j] ; 
y[j]=ya[j] ; 

> 

for (j=0;j<=n;j++) { 

polint(x-l,y-l,n+l-j,0.0,&cof[j],&dy); 

Subtract 1 from the pointers to x and y because polint uses dimensions [ 1 . .n] . We 
extrapolate to x = 0. 
xmin=1.0e38; 
k = -1; 

for (i=0;i<=n-j; i++) { Find the remaining Xj of smallest 

if (fabs(x[i]) < xmin) { absolute value, 

xmin=fabs(x[i]); 
k=i; 

> 

if (x[i]) y [i] = (y [i]-cof [j] )/x[i] ; (meanwhile reducing all the terms) 

> 

for (i=k+l;i<=n-j ;i++) { and eliminate it. 

y[i-l]=y[i]; 
x[i-l]=x[i] ; 

> 

> 

free_vector(y,0,n); 
free_vector(x,0,n); 



If the point x = 0 is not in (or at least close to) the range of the tabulated xfs, 
then the coefficients of the interpolating polynomial will in general become very large. 
However, the real “information content” of the coefficients is in small differences 
from the “translation-induced” large values. This is one cause of ill-conditioning, 
resulting in loss of significance and poorly determined coefficients. You should 
consider redefining the origin of the problem, to put x = 0 in a sensible place. 

Another pathology is that, if too high a degree of interpolation is attempted on 
a smooth function, the interpolating polynomial will attempt to use its high-degree 
coefficients, in combinations with large and almost precisely canceling combinations, 
to match the tabulated values down to the last possible epsilon of accuracy. This 
effect is the same as the intrinsic tendency of the interpolating polynomial values to 
oscillate (wildly) between its constrained points, and would be present even if the 
machine’s floating precision were infinitely good. The above routines polcoe and 
polcof have slightly different sensitivities to the pathologies that can occur. 

Are you still quite certain that using the coefficients is a good idea? 

CITED REFERENCES AND FURTHER READING: 

Isaacson, E., and Keller, H.B. 1966, Analysis of Numerical Methods (New York: Wiley), §5.2. 
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3.6 Interpolation in Two or More Dimensions 

In multidimensional interpolation, we seek an estimate of y(xi,X 2 , ■ ■ ■ ,x n ) 
from an n-dimensional grid of tabulated values y and n one-dimensional vec¬ 
tors giving the tabulated values of each of the independent variables x i, x 2 , ■ ■ ■, 
x n . We will not here consider the problem of interpolating on a mesh that is not 
Cartesian, i.e., has tabulated function values at “random” points in n-dimensional 
space rather than at the vertices of a rectangular array. For clarity, we will consider 
explicitly only the case of two dimensions, the cases of three or more dimensions 
being analogous in every way. 

In two dimensions, we imagine that we are given a matrix of functional values 
ya[l. .m] [1. .n]. We are also given an array xla[l. .m], and an array x2a[l. .n]. 
The relation of these input quantities to an underlying function y(x i, £ 2 ) is 

ya[j] [k] = y(xla[j],x2a[k]) (3.6.1) 

We want to estimate, by interpolation, the function y at some untabulated point 

(xi,x 2 ). 

An important concept is that of the grid square in which the point (x 1 ,x 2 ) 
falls, that is, the four tabulated points that surround the desired interior point. For 
convenience, we will number these points from 1 to 4, counterclockwise starting 
from the lower left (see Figure 3.6.1). More precisely, if 

xla[j] <x\< xla[j+l] 

(3.6.2) 

x2a [k] <x 2 < x2a[k+l] 

defines j and k, then 


Vi = ya[j] [k] 
y 2 = ya[j+l] [k] 

2/3 = ya[j+l] [k+1] 
2/4 = ya[j] [k+1] 


The simplest interpolation in two dimensions is bilinear interpolation on the 
grid square. Its formulas are: 

t = (x 1 - xla[j])/(xla[j+l] - xla[j]) 

u = ( x 2 — x2a [k]) / (x2a [k+1] — x2a [k]) 



(so that t and u each lie between 0 and 1), and 

y(x 1 , * 2 ) = (1 - t)(l - u)y 1 + t(l - u)y 2 + fm /3 + (1 - t)uy 4 (3.6.5) 

Bilinear interpolation is frequently “close enough for government work.” As 
the interpolating point wanders from grid square to grid square, the interpolated 
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function value changes continuously. However, the gradient of the interpolated 
function changes discontinuously at the boundaries of each grid square. 

There are two distinctly different directions that one can take in going beyond 
bilinear interpolation to higher-order methods: One can use higher order to obtain 
increased accuracy for the interpolated function (for sufficiently smooth functions!), 
without necessarily trying to fix up the continuity of the gradient and higher 
derivatives. Or, one can make use of higher order to enforce smoothness of some of 
these derivatives as the interpolating point crosses grid-square boundaries. We will 
now consider each of these two directions in turn. 



Higher Order for Accuracy 

The basic idea is to break up the problem into a succession of one-dimensional 
interpolations. If we want to do m-1 order interpolation in the x i direction, and n-1 
order in the X2 direction, we first locate an m x n sub-block of the tabulated function 
matrix that contains our desired point (xi,xz). We then do m one-dimensional 
interpolations in the X2 direction, i.e., on the rows of the sub-block, to get function 
values at the points (xla[j], X2), j = 1,... ,m. Finally, we do a last interpolation 
in the x\ direction to get the answer. If we use the polynomial interpolation routine 
polint of §3.1, and a sub-block which is presumed to be already located (and 
addressed through the pointer float **ya, see §1.2), the procedure looks like this: 

tinclude "nrutil.h" 

void polin2(float xla[], float x2a[], float **ya, int m, int n, float xl, 
float x2, float *y, float *dy) 

Given arrays xla[l. .m] and x2a[l. .n] of independent variables, and a submatrix of function 
values ya[l. .m] [1. .n] , tabulated at the grid points defined by xla and x2a; and given values 
xl and x2 of the independent variables; this routine returns an interpolated function value y, 
and an accuracy indication dy (based only on the interpolation in the xl direction, however). 

{ 

void polint(float xa[] , float ya[] , int n, float x, float *y, float *dy); 
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int j; 

float *ymtmp; 

ymtmp=vector(1,m); 
for (j=l;j<=m;j++) { 

polint(x2a,ya[j],n,x2,&ymtmp[j],dy); 

> 

polint(xla,ymtmp,m,xl,y,dy); 
free_vector(ymtmp,l,m); 


Loop over rows. 

Interpolate answer into temporary stor¬ 
age. 

Do the final interpolation. 


Higher Order for Smoothness: Bicubic Interpolation 


We will give two methods that are in common use, and which are themselves 
not unrelated. The first is usually called bicubic interpolation. 

Bicubic interpolation requires the user to specify at each grid point not just 
the function y(x 1 , 012 ), but also the gradients dy/dx i = y t \, dy/dx 2 = y t 2 and 
the cross derivative d 2 y/dx\dx2 = y, 12 - Then an interpolating function that is 
cubic in the scaled coordinates t and u (equation 3.6.4) can be found, with the 
following properties: (i) The values of the function and the specified derivatives 
are reproduced exactly on the grid points, and (ii) the values of the function and 
the specified derivatives change continuously as the interpolating point crosses from 
one grid square to another. 

It is important to understand that nothing in the equations of bicubic interpolation 
requires you to specify the extra derivatives correctly ! The smoothness properties are 
tautologically “forced,” and have nothing to do with the “accuracy” of the specified 
derivatives. It is a separate problem for you to decide how to obtain the values that 
are specified. The better you do, the more accurate the interpolation will be. But 
it will be smooth no matter what you do. 

Best of all is to know the derivatives analytically, or to be able to compute them 
accurately by numerical means, at the grid points. Next best is to determine them by 
numerical differencing from the functional values already tabulated on the grid. The 
relevant code would be something like this (using centered differencing): 


yla[j] [k] = (ya[j+l] [k]-ya [ j -1] [k] )/(xla [j+1] -xla[j-l] ); 
y2a[j] [k] = (ya[j] [k+l]-ya[j] [k-1] )/(x2a[k+l]-x2a[k-l] ); 
yl2a[j] [k] = (ya[j+1] [k+1]-ya[j+l] [k-1]-ya[j-l] [k+l]+ya[j-l] [k-1]) 
/ ((xla [j+1] -xla [j -1]) * (x2a [k+1] -x2a [k-1] )); 


To do a bicubic interpolation within a grid square, given the function y and the 
derivatives yl, y2, yl2 at each of the four corners of the square, there are two steps: 
First obtain the sixteen quantities Cij, i,j = 1,... ,4 using the routine bcucof 
below. (The formulas that obtain the c’s from the function and derivative values 
are just a complicated linear transformation, with coefficients which, having been 
determined once in the mists of numerical history, can be tabulated and forgotten.) 
Next, substitute the c’s into any or all of the following bicubic formulas for function 
and derivatives, as desired: 
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4 


y(x i,a 

i=l j= 1 
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y, 12 {xi, a 
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m. m 



where t and u are again given by equation (3.6.4). 

void bcucof(float y[], float yl[], float y2[], float yl2[], float dl, float d2, 
float **c) 

Given arrays y [1. .4] , yl [1. .4] , y2 [1. .4] , and yl2 [1. .4] , containing the function, gra¬ 
dients, and cross derivative at the four grid points of a rectangular grid cell (numbered coun¬ 
terclockwise from the lower left), and given dl and d2, the length of the grid cell in the 1- and 
2-directions, this routine returns the table c [1. .4] [1. .4] that is used by routine bcuint 
for bicubic interpolation. 

{ 

static int wt[16][16] = 

{ 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 

0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0, 

-3,0,0,3,0,0,0,0,-2,0,0,-1,0,0,0,0, 

2,0,0,-2,0,0,0,0,1,0,0,1,0,0,0,0, 

0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 

0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 0 , 

0,0,0,0,-3,0,0,3,0,0,0,0,-2,0,0,-1, 

0 , 0 , 0 , 0 , 2 , 0 , 0 ,- 2 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 1 , 

-3,3,0,0,-2,-1,0,0,0,0,0,0,0,0,0,0, 

0,0,0,0,0,0,0,0,-3,3,0,0,-2,-1,0,0, 

9,-9,9,-9,6,3,-3,-6,6,-6,-3,3,4,2,1,2, 

-6,6,-6,6,-4,-2,2,4,-3,3,3,-3,-2,-1,-1,-2, 

2 ,- 2 , 0 , 0 , 1 , 1 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 

0 , 0 , 0 , 0 , 0 , 0 , 0 , 0 , 2 ,- 2 , 0 , 0 , 1 , 1 , 0 , 0 , 

-6,6,-6,6,-3,-3,3,3,-4,4,2,-2,-2,-2,-1,-1, 

4,-4,4,-4,2,2,-2,-2,2,-2,-2,2,1,1,1,1}; 
int 1,k,j,i; 

float xx,dld2,cl[16],x[16]; 



dld2=dl*d2; 

for (i=l;i<=4;i++) { Pack a temporary vector x. 

x[i-l]=y[i] ; 
x[i+3] =yl [i] *dl; 
x [i+7] =y2 [i] *d2; 
x[i+ll]=yl2[i] *dld2; 

} 

for (i=0; i<=15; i++) { Matrix multiply by the stored table. 

xx=0.0; 

for (k=0;k<=15;k++) xx += wt[i][k]*x[k]; 
cl[i]=xx; 

} 

1 = 0 ; 

for (i=l;i<=4;i++) Unpack the result into the output table, 

for (j=l; j<=4; j++) c[i] [j] =cl[1++] ; 
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The implementation of equation (3.6.6), which performs a bicubic interpolation, 
gives back the interpolated function value and the two gradient values, and uses the 
above routine bcucof, is simply: 


#include "nrutil.h" 

void bcuint(float y[], float yl[], float y2[], float yl2[], float xll, 
float xlu, float x21, float x2u, float xl, float x2, float *ansy, 
float *ansyl, float *ansy2) 

Bicubic interpolation within a grid square. Input quantities are y,yl,y2,yl2 (as described in 
bcucof); xll and xlu, the lower and upper coordinates of the grid square in the 1-direction; 
x21 and x2u likewise for the 2-direction; and xl,x2, the coordinates of the desired point for 
the interpolation. The interpolated function value is returned as ansy, and the interpolated 
gradient values as ansyl and ansy2. This routine calls bcucof. 

{ 

void bcucof (float y[] , float yl[], float y2[], float yl2[], float dl, 
float d2, float **c); 
int i; 

float t,u,dl,d2,**c; 

c=matrix(l,4,1,4); 

dl=xlu-xll; 

d2=x2u-x21; 

bcucof (y,yl ,y2 ) yl2,dl,d2 > c) ; Get the c's. 

if (xlu == xll || x2u == x21) nrerror("Bad input in routine bcuint"); 
t=(xl-xll)/dl; Equation (3.6.4). 

U=(x2-x21)/d2; 

*ansy=(*ansy2)=(*ansyl)=0.0; 

for (i=4;i>=l;i—) { Equation (3.6.6). 

*ansy=t*(*ansy)+((c[i] [4]*u+c[i] [3])*u+c[i] [2])*u+c[i] [1] ; 
*ansy2=t*(*ansy2) + (3.0*c[i] [4]*u+2.0*c[i] [3])*u+c[i] [2] ; 
*ansyl=u*(*ansyl) + (3.0*c [4] [i] *t+2.0*c[3] [i] )*t+c [2] [i] ; 

> 

*ansyl /= dl; 

*ansy2 /= d2; 
free_matrix(c,1,4,1,4); 


Higher Order for Smoothness: Bicubic Spline 

The other common technique for obtaining smoothness in two-dimensional 
interpolation is the bicubic spline. Actually, this is equivalent to a special case 
of bicubic interpolation: The interpolating function is of the same functional form 
as equation (3.6.6); the values of the derivatives at the grid points are, however, 
determined “globally” by one-dimensional splines. However, bicubic splines are 
usually implemented in a form that looks rather different from the above bicubic 
interpolation routines, instead looking much closer in form to the routine polin2 
above: To interpolate one functional value, one performs m one-dimensional splines 
across the rows of the table, followed by one additional one-dimensional spline 
down the newly created column. It is a matter of taste (and trade-off between time 
and memory) as to how much of this process one wants to precompute and store. 
Instead of precomputing and storing all the derivative information (as in bicubic 
interpolation), spline users typically precompute and store only one auxiliary table, 
of second derivatives in one direction only. Then one need only do spline evaluations 
(not constructions) for the m row splines; one must still do a construction and an 
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evaluation for the final column spline. (Recall that a spline construction is a process 
of order N, while a spline evaluation is only of order log N — and that is just to 
find the place in the table!) 

Here is a routine to precompute the auxiliary second-derivative table: 

void splie2(float xla[], float x2a[], float **ya, int m, int n, float **y2a) 
Given an m by n tabulated function ya[l. .m] [1. .n] , and tabulated independent variables 
x2a[l. .n], this routine constructs one-dimensional natural cubic splines of the rows of ya 
and returns the second-derivatives in the array y2a[l. .m] [1. .n] . (The array xla[l. .m] is 
included in the argument list merely for consistency with routine splin2.) 

{ 

void spline(float x[], float y[], int n, float ypl, float ypn, float y2[]); 
int j; 


for (j=l;j<=m;j++) 

spline(x2a,ya[j] ,n, 1.0e30,1.0e30,y2a[j] ) ; Values lxlO 30 signal a nat- 

} ural spline. 


(If you want to interpolate on a sub-block of a bigger matrix, see §1.2.) 

After the above routine has been executed once, any number of bicubic spline 
interpolations can be performed by successive calls of the following routine: 

#include "nrutil.h" 

void splin2(float xla[], float x2a[], float **ya, float **y2a, int m, int n, 
float xl, float x2, float *y) 

Given xla, x2a, ya, m, n as described in splie2 and y2a as produced by that routine; and 
given a desired interpolating point xl,x2; this routine returns an interpolated function value y 
by bicubic spline interpolation. 

{ 

void spline(float x[], float y[], int n, float ypl, float ypn, float y2[]); 
void splint(float xa[] , float ya[] , float y2a[], int n, float x, float *y); 
int j; 

float *ytmp,*yytmp; 
ytmp=vector(1,m); 

yytmp=vector(l,m); Perform m evaluations of the row splines constructed by 

for (j=l;j<=m;j++) splie2, using the one-dimensional spline evaluator 

splint(x2a,ya[j] ,y2a[j] ,n,x2,&yytmp[j] ) ; splint. 

spline(xla,yytmp,m, 1,0e30,1.0e30,ytmp) ; Construct the one-dimensional col- 

splint(xla,yytmp,ytmp,m,xl,y); umn spline and evaluate it. 

free_vector(yytmp,l,m); 
free_vector(ytmp,1,m); 


CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), §25.2. 

Kinahan, B.F., and Harm, R. 1975, Astrophysical Journal, vol. 200, pp. 330-335. 

Johnson, L.W., and Riess, R.D. 1982, Numerical Analysis, 2nd ed. (Reading, MA: Addison- 
Wesley), §5.2.7. 

Dahlquist, G., and Bjorck, A. 1974, Numerical Methods (Englewood Cliffs, NJ: Prentice-Hall), 
§7.7. 
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Chapter 4. Integration of Functions 

4.0 Introduction 


Numerical integration, which is also called quadrature , has a history extending 
back to the invention of calculus and before. The fact that integrals of elementary 
functions could not, in general, be computed analytically, while derivatives could 
be, served to give the field a certain panache, and to set it a cut above the arithmetic 
drudgery of numerical analysis during the whole of the 18th and 19th centuries. 

With the invention of automatic computing, quadrature became just one numer¬ 
ical task among many, and not a very interesting one at that. Automatic computing, 
even the most primitive sort involving desk calculators and rooms full of “computers” 
(that were, until the 1950s, people rather than machines), opened to feasibility the 
much richer field of numerical integration of differential equations. Quadrature is 
merely the simplest special case: The evaluation of the integral 



1 = J f(x)dx (4.0.1) 

is precisely equivalent to solving for the value I = y(b ) the differential equation 

j- = f(4 (4-0.2) 

ax 

with the boundary condition 

V{a) = 0 (4.0.3) 

Chapter 16 of this book deals with the numerical integration of differential 
equations. In that chapter, much emphasis is given to the concept of “variable” or 
“adaptive” choices of stepsize. We will not, therefore, develop that material here. 
If the function that you propose to integrate is sharply concentrated in one or more 
peaks, or if its shape is not readily characterized by a single length-scale, then it 
is likely that you should cast the problem in the form of (4.0.2)-(4.0.3) and use 
the methods of Chapter 16. 

The quadrature methods in this chapter are based, in one way or another, on the 
obvious device of adding up the value of the integrand at a sequence of abscissas 
within the range of integration. The game is to obtain the integral as accurately 
as possible with the smallest number of function evaluations of the integrand. Just 
as in the case of interpolation (Chapter 3), one has the freedom to choose methods 
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of various orders, with higher order sometimes, but not always, giving higher 
accuracy. “Romberg integration,” which is discussed in §4.3, is a general formalism 
for making use of integration methods of a variety of different orders, and we 
recommend it highly. 

Apart from the methods of this chapter and of Chapter 16, there are yet 
other methods for obtaining integrals. One important class is based on function 
approximation. We discuss explicitly the integration of functions by Chebyshev 
approximation (“Clenshaw-Curtis” quadrature) in §5.9. Although not explicitly 
discussed here, you ought to be able to figure out how to do cubic spline quadrature 
using the output of the routine spline in §3.3. (Hint: Integrate equation 3.3.3 
over x analytically. See [1 ].) 

Some integrals related to Fourier transforms can be calculated using the fast 
Fourier transform (FFT) algorithm. This is discussed in §13.9. 

Multidimensional integrals are another whole multidimensional bag of worms. 
Section 4.6 is an introductory discussion in this chapter; the important technique of 
Monte-Carlo integration is treated in Chapter 7. 


CITED REFERENCES AND FURTHER READING: 

Carnahan, B., Luther, H.A., and Wilkes, J.O. 1969, Applied Numerical Methods (New York: 
Wiley), Chapter 2. 

Isaacson, E., and Keller, H.B. 1966, Analysis of Numerical Methods (New York: Wiley), Chapter 7. 
Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), Chapter 4. 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
Chapter 3. 

Ralston, A., and Rabinowitz, P. 1978, A First Course in Numerical Analysis, 2nd ed. (New York: 
McGraw-Hill), Chapter 4. 

Dahlquist, G., and Bjorck, A. 1974, Numerical Methods (Englewood Cliffs, NJ: Prentice-Hall), 
§7.4. 

Kahaner, D., Moler, C., and Nash, S. 1989, Numerical Methods and Software (Englewood Cliffs, 
NJ: Prentice Hall), Chapter 5. 

Forsythe, G.E., Malcolm, M.A., and Moler, C.B. 1977, Computer Methods for Mathematical 
Computations (Englewood Cliffs, NJ: Prentice-Hall), §5.2, p. 89. [1] 

Davis, R, and Rabinowitz, P. 1984, Methods of Numerical Integration, 2nd ed. (Orlando, FL: 
Academic Press). 


4.1 Classical Formulas for Equally Spaced 
Abscissas 

Where would any book on numerical analysis be without Mr. Simpson and his 
“rule”? The classical formulas for integrating a function whose value is known at 
equally spaced steps have a certain elegance about them, and they are redolent with 
historical association. Through them, the modern numerical analyst communes with 
the spirits of his or her predecessors back across the centuries, as far as the time 
of Newton, if not farther. Alas, times do change; with the exception of two of the 
most modest formulas (“extended trapezoidal rule,” equation 4.1.11, and “extended 
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open formulas use these points 
closed formulas use these points 

Figure 4.1.1. Quadrature formulas with equally spaced abscissas compute the integral of a function 
between xo and xn+i- Closed formulas evaluate the function on the boundary points, while open 
formulas refrain from doing so (useful if the evaluation algorithm breaks down on the boundary points). 

midpoint rule,” equation 4.1.19, see §4.2), the classical formulas are almost entirely 
useless. They are museum pieces, but beautiful ones. 

Some notation: We have a sequence of abscissas, denoted xo,xi,..., xn, 
xn+i which are spaced apart by a constant step h, 

Xi = xo + ih i = 0,1,... ,N + 1 (4.1.1) 

A function f(x) has known values at the xds, 


f(xi) = fi (4.1.2) 

We want to integrate the function f(x) between a lower limit a and an upper limit 
b, where a and b are each equal to one or the other of the x,’s. An integration 
formula that uses the value of the function at the endpoints, /(a) or /(&), is called 
a closed formula. Occasionally, we want to integrate a function whose value at one 
or both endpoints is difficult to compute (e.g., the computation of / goes to a limit 
of zero over zero there, or worse yet has an integrable singularity there). In this 
case we want an open formula, which estimates the integral using only Xi’s strictly 
between a and b (see Figure 4.1.1). 

The basic building blocks of the classical formulas are rules for integrating a 
function over a small number of intervals. As that number increases, we can find 
rules that are exact for polynomials of increasingly high order. (Keep in mind that 
higher order does not always imply higher accuracy in real cases.) A sequence of 
such closed formulas is now given. 

Closed Newton-Cotes Formulas 

Trapezoidal rule: 

f{x)dx = h\^-h+ l -f^ +0(h 3 f") (4.1.3) 

Here the error term 0( ) signifies that the true answer differs from the estimate by 
an amount that is the product of some numerical coefficient times h 3 times the value 
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of the function’s second derivative somewhere in the interval of integration. The 
coefficient is knowable, and it can be found in all the standard references on this 
subject. The point at which the second derivative is to be evaluated is, however, 
unknowable. If we knew it, we could evaluate the function there and have a higher- 
order method! Since the product of a knowable and an unknowable is unknowable, 
we will streamline our formulas and write only 0 (), instead of the coefficient. 

Equation (4.1.3) is a two-point formula (x i and x 2 ). It is exact for polynomials 
up to and including degree 1, i.e., f(x) = x. One anticipates that there is a 
three-point formula exact up to polynomials of degree 2. This is true; moreover, by a 
cancellation of coefficients due to left-right symmetry of the formula, the three-point 
formula is exact for polynomials up to and including degree 3, i.e., f(x) = x 3 : 

Simpson’s rule: 


£ f(x)dx = h [i/r + p 2 + |/ 3 ] + 0(/i 5 / (4) ) (4.1.4) 

Here f ( ' r> means the fourth derivative of the function / evaluated at an unknown 
place in the interval. Note also that the formula gives the integral over an interval 
of size 2 h, so the coefficients add up to 2. 

There is no lucky cancellation in the four-point formula, so it is also exact for 
polynomials up to and including degree 3. 

Simpson’s | rule: 

£ f{x)dx = h jj/r + 9 -f 2 + |/ 3 + |/ 4 ] + 0(h 5 / (4) ) (4.1.5) 

The five-point formula again benefits from a cancellation: 


Bode’s rule: 


/: 


5 f( x )dx = h^f 1 +^f 2 +^f 3 +^f 4 + 1 £ 5 \ +0(h 7 / (6) ) (4.1.6) 


This is exact for polynomials up to and including degree 5. 

At this point the formulas stop being named after famous personages, so we 
will not go any further. Consult [1 ] for additional formulas in the sequence. 


Extrapolative Formulas for a Single Interval 


We are going to depart from historical practice for a moment. Many texts 
would give, at this point, a sequence of “Newton-Cotes Formulas of Open Type.” 
Here is an example: 


f Xb T 5 

/ f(x)dx = h - 
Jx 0 


f(*)d* = h\% h + A/ 2 -f .|/ 3 + |/ 4 


+ 0(/i 5 / (4) ) 
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Notice that the integral from a = x o to b = X 5 is estimated, using only the interior 
points x-[ , X2, x 3 , X4. In our opinion, formulas of this type are not useful for the 
reasons that (i) they cannot usefully be strung together to get “extended” rules, as we 
are about to do with the closed formulas, and (ii) for all other possible uses they are 
dominated by the Gaussian integration formulas which we will introduce in §4.5. 

Instead of the Newton-Cotes open formulas, let us set out the formulas for 
estimating the integral in the single interval from xq to x\, using values of the 
function / at x\, X 2 , ■ ■ ■ . These will be useful building blocks for the “extended” 
open formulas. 


rXi 

J Xq 

pxi 
J Xq 


f(x)dx = h[f 1 \ +0(h 2 f) (4.1.7) 

f(x)dx = (4-1.8) 

f(x)dx = h[g/r - j|/ 2 + ^/ 3 ] + 0{h A fW) (4.1.9) 

f(x)dx = h[g/i - + g/ 3 - ^/ 4 ] +0(h 5 / (4) )(4.1.10) 



Perhaps a word here would be in order about how formulas like the above can 
be derived. There are elegant ways, but the most straightforward is to write down the 
basic form of the formula, replacing the numerical coefficients with unknowns, say 
p, q, r, s. Without loss of generality take xo = 0 and x-[ = 1, so h = 1. Substitute in 
turn for f(x) (and for / 1 , /b, / 3 , / 4 ) the functions f(x) = 1, f(x) = x, f(x) = x 2 , 
and f(x ) = x 3 . Doing the integral in each case reduces the left-hand side to a 
number, and the right-hand side to a linear equation for the unknowns p, q, r, s. 
Solving the four equations produced in this way gives the coefficients. 


Extended Formulas (Closed) 

If we use equation (4.1.3) N — 1 times, to do the integration in the intervals 
(xi,x 2 ), (a: 2 , x 3 ),..., (xjv-i,xjv), and then add the results, we obtain an “extended” 
or “composite” formula for the integral from x -\ to xn- 
Extended trapezoidal rule: 




f(x)dx = h -h + h + / 3 + 


• + In-i + 2 /n 


+ 0 


(^) 


(4.1.11) 




Here we have written the error estimate in terms of the interval b — a and the number 
of points N instead of in terms of h. This is clearer, since one is usually holding 
a and b fixed and wanting to know (e.g.) how much the error will be decreased 
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by taking twice as many steps (in this case, it is by a factor of 4). In subsequent 
equations we will show only the scaling of the error term with the number of steps. 

For reasons that will not become clear until §4.2, equation (4.1.11) is in fact 
the most important equation in this section, the basis for most practical quadrature 
schemes. 

The extended formula of order 1 /TV 3 is: 


13 , , , 

+ h + fi- 


' + /TV-2 + jg/iV-l + 


(We will see in a moment where this comes from.) 

If we apply equation (4.1.4) to successive, nono 
we get the extended Simpson’s rule: 


f(x)dx = h i/i + ^/ 2 + |/ 3 - + 


Notice that the 2/3, 4/3 alternation continues throughout the interior of the evalu¬ 
ation. Many people believe that the wobbling alternation somehow contains deep 
information about the integral of their function that is not apparent to mortal eyes. 
In fact, the alternation is an artifact of using the building block (4.1.4). Another 
extended formula with the same order as Simpson’s rule is 


1 

/ 1 > 

(4.1.12) 

N 8 

o 5'' 

] + ° 


) 

icluding this oi 
m or call 1 -80 

erlapping pairs 

of intervals, 

' CD 

ro o 
<| 0) 

CO w 
CD 

o i 


f i \ 

(4.1.13) 

>1 

] +°( 

v^) 


CD c • 

S? 

o 55' 



This equation is constructed by fitting cubic polynomials through successive groups 
of four points; we defer details to §18.3, where a similar technique is used in the 
solution of integral equations. We can, however, tell you where equation (4.1.12) 
came from. It is Simpson’s extended rule, averaged with a modified version of 
itself in which the first and last step are done with the trapezoidal rule (4.1.3). The 
trapezoidal step is two orders lower than Simpson’s rule; however, its contribution to 
the integral goes down as an additional power of N (since it is used only twice, not 
N times). This makes the resulting formula of degree one less than Simpson. 
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• -• N= 1 

I-•-1 2 

I-•-•-1 3 

i-*-•-•-»-1 4 

• -•-•-•-•-•-•-•-• (total after N = 4) 

Figure 4.2.1. Sequential calls to the routine trapzd incorporate the information from previous calls and 
evaluate the integrand only at those new points necessary to refine the grid. The bottom line shows the 
totality of function evaluations after the fourth call. The routine qsimp, by weighting the intermediate 
results, transforms the trapezoid rule into Simpson’s rule with essentially no additional overhead. 

There are also formulas of higher order for this situation, but we will refrain from 
giving them. 

The semi-open formulas are just the obvious combinations of equations (4.1.11)- 
(4.1.14) with (4.1.15)—(4.1.18), respectively. At the closed end of the integration, 
use the weights from the former equations; at the open end use the weights from 
the latter equations. One example should give the idea, the formula with error term 
decreasing as 1/A 3 which is closed on the right and open on the left: 

fx N roo 7 

J x mdx = h [-f 2 + -fs + f 4 + f s + 

13 5 

• • • + In -2 + | ^ /v i + ^/iv 

CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), §25.4. [1] 

Isaacson, E., and Keller, H.B. 1966, Analysis of Numerical Methods (New York: Wiley), §7.1. 


(4.1.20) 


4.2 Elementary Algorithms 

Our starting point is equation (4.1.11), the extended trapezoidal rule. There are 
two facts about the trapezoidal rule which make it the starting point for a variety of 
algorithms. One fact is rather obvious, while the second is rather “deep.” 

The obvious fact is that, for a fixed function f(x) to be integrated between fixed 
limits a and b, one can double the number of intervals in the extended trapezoidal 
rule without losing the benefit of previous work. The coarsest implementation of 
the trapezoidal rule is to average the function at its endpoints a and b. The first 
stage of refinement is to add to this average the value of the function at the halfway 
point. The second stage of refinement is to add the values at the 1/4 and 3/4 points. 
And so on (see Figure 4.2.1). 

Without further ado we can write a routine with this kind of logic to it: 
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#define FUNC(x) ((*func)(x)) 

float trapzd(float (*func)(float), float a, float b, int n) 

This routine computes the nth stage of refinement of an extended trapezoidal rule, func is input 
as a pointer to the function to be integrated between limits a and b, also input. When called with 
n=l, the routine returns the crudest estimate of f(x)dx. Subsequent calls with n=2,3,... 
(in that sequential order) will improve the accuracy by adding 2 n_ ^ additional interior points. 
{ 

float x,trim,sum,del; 
static float s; 
int it,j; 

if (n == 1) { 

return (s=0.5*(b-a)*(FUNC(a)+FUNC(b))); 

> else { 

for (it=l,j=l;j<n-l;j++) it «= 1; 
tnm=it; 

del=(b-a)/tnm; This is the spacing of the points to be added. 

x=a+0.5*del; 

for (sum=0.0,j=l;j<=it;j++,x+=del) sum += FUNC(x); 
s=0.5*(s+(b-a)*sum/tnm); This replaces s by its refined value, 

return s; 

> 

> 


The above routine (trapzd) is a workhorse that can be harnessed in several 
ways. The simplest and crudest is to integrate a function by the extended trapezoidal 
rule where you know in advance (we can’t imagine how!) the number of steps you 
want. If you want 2 M + 1, you can accomplish this by the fragment 


for(j=l;j<=m+l;j++) s=trapzd(func,a,b,j); 

with the answer returned as s. 

Much better, of course, is to refine the trapezoidal rule until some specified 
degree of accuracy has been achieved: 

#include <math.h> 

#define EPS 1.0e-5 
#define JMAX 20 

float qtrap(float (*func)(float), float a, float b) 

Returns the integral of the function func from a to b. The parameters EPS can be set to the 
desired fractional accuracy and JMAX so that 2 to the power JMAX-1 is the maximum allowed 
number of steps. Integration is performed by the trapezoidal rule. 

{ 

float trapzd(float (*func)(float), float a, float b, int n); 
void nrerror(char error_text []); 
int j; 

float s,olds=0.0; Initial value of olds is arbitrary. 

for (j=1;j <=JMAX;j ++) { 
s=trapzd(func,a,b,j); 

if (j > 5) Avoid spurious early convergence, 

if (fabs(s-olds) < EPS*fabs(olds) I I 

(s == 0.0 && olds == 0.0)) return s; 

olds=s; 

> 

nrerrorO'Too many steps in routine qtrap"); 
return 0.0; Never get here. 



> 
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Unsophisticated as it is, routine qtrap is in fact a fairly robust way of doing 
integrals of functions that are not very smooth. Increased sophistication will usually 
translate into a higher-order method whose efficiency will be greater only for 
sufficiently smooth integrands, qtrap is the method of choice, e.g., for an integrand 
which is a function of a variable that is linearly interpolated between measured data 
points. Be sure that you do not require too stringent an EPS, however: If qtrap takes 
too many steps in trying to achieve your required accuracy, accumulated roundoff 
errors may start increasing, and the routine may never converge. A value 10 “ 6 
is just on the edge of trouble for most 32-bit machines; it is achievable when the 
convergence is moderately rapid, but not otherwise. 

We come now to the “deep” fact about the extended trapezoidal rule, equation 
(4.1.11). It is this: The error of the approximation, which begins with a term of 
order 1 /N 2 , is in fact entirely even when expressed in powers of 1 /N. This follows 
directly from the Euler-Maclaurin Summation Formula , 
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Equation (4.2.1) is not a convergent expansion, but rather only an asymptotic 
expansion whose error when truncated at any point is always less than twice the 
magnitude of the first neglected term. The reason that it is not convergent is that 
the Bernoulli numbers become very large, e.g., 


„ 495057205241079648212477525 

B 50 =- wr, - 

66 

The key point is that only even powers of h occur in the error series of (4.2.1). 
This fact is not, in general, shared by the higher-order quadrature rules in §4.1. 
For example, equation (4.1.12) has an error series beginning with 0(l/N 3 ), but 
continuing with all subsequent powers of N: 1/7V 4 , 1 /A r5 , etc. 

Suppose we evaluate (4.1.11) with N steps, getting a result Sn, and then again 
with 2 N steps, getting a result S 2 n■ (This is done by any two consecutive calls of 
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trapzd.) The leading error term in the second evaluation will be 1/4 the size of the 
error in the first evaluation. Therefore the combination 





(4.2.4) 


will cancel out the leading order error term. But there is no error term of order 1 /N 3 , 
by (4.2.1). The surviving error is of order l/N 4 , the same as Simpson’s rule. In fact, 
it should not take long for you to see that (4.2.4) is exactly Simpson’s rule (4.1.13), 
alternating 2/3’s, 4/3’s, and all. This is the preferred method for evaluating that rule, 
and we can write it as a routine exactly analogous to qtrap above: 

#include <math.h> 

#define EPS 1.0e-6 
#def ine JMAX 20 

float qsimp(float (*func)(float), float a, float b) 

Returns the integral of the function func from a to b. The parameters EPS can be set to the 
desired fractional accuracy and JMAX so that 2 to the power JMAX-1 is the maximum allowed 
number of steps. Integration is performed by Simpson's rule. 

{ 

float trapzd(float (*func)(float), float a, float b, int n); 
void nrerror(char error_text []); 
int j; 

float s.stjOstKI.O^sKl.O; 

for (j=l;j<=JMAX;j++) { 
st=trapzd(func,a,b,j); 

s=(4.0*st-ost)/3.0; Compare equation (4.2.4), above, 

if (j > 5) Avoid spurious early convergence, 

if (fabs(s-os) < EPS*fabs(os) I I 

(s == 0.0 && os == 0.0)) return s; 

os=s; 
ost=st; 

> 

nrerror("Too many steps in routine qsimp"); 
return 0.0; Never get here. 


The routine qsimp will in general be more efficient than qtrap (i.e., require 
fewer function evaluations) when the function to be integrated has a finite 4th 
derivative (i.e., a continuous 3rd derivative). The combination of qsimp and its 
necessary workhorse trapzd is a good one for light-duty work. 


CITED REFERENCES AND FURTHER READING: 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
§3.3. 

Dahlquist, G., and Bjorck, A. 1974, Numerical Methods (Englewood Cliffs, NJ: Prentice-Hall), 
§§7.4.1-7.4.2. 

Forsythe, G.E., Malcolm, M.A., and Moler, C.B. 1977, Computer Methods for Mathematical 
Computations (Englewood Cliffs, NJ: Prentice-Hall), §5.3. 
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4.3 Romberg Integration 


We can view Romberg’s method as the natural generalization of the routine 
qsimp in the last section to integration schemes that are of higher order than 
Simpson’s rule. The basic idea is to use the results from k successive refinements 
of the extended trapezoidal rule (implemented in trapzd) to remove all terms in 
the error series up to but not including 0(1/N 2k ). The routine qsimp is the case 
of k = 2. This is one example of a very general idea that goes by the name of 
Richardson’s deferred approach to the limit : Perform some numerical algorithm for 
various values of a parameter h, and then extrapolate the result to the continuum 
limit h = 0. 

Equation (4.2.4), which subtracts off the leading error term, is a special case of 
polynomial extrapolation. In the more general Romberg case, we can use Neville’s 
algorithm (see §3.1) to extrapolate the successive refinements to zero stepsize. 
Neville’s algorithm can in fact be coded very concisely within a Romberg integration 
routine. For clarity of the program, however, it seems better to do the extrapolation 
by function call to polint, already given in §3.1. 


#include <math.h> 

#define EPS 1.0e-6 
#define JMAX 20 
#define JMAXP (JMAX+1) 

#define K 5 

Here EPS is the fractional accuracy desired, as determined by the extrapolation error estimate; 
JMAX limits the total number of steps; K is the number of points used in the extrapolation. 

float qromb(float (*func)(float), float a, float b) 

Returns the integral of the function func from a to b. Integration is performed by Romberg's 
method of order 2K, where, e.g., K=2 is Simpson's rule. 

{ 

void polint(float xa[] , float ya[] , int n, float x, float *y, float *dy); 
float trapzd(float (*func)(float), float a, float b, int n); 
void nrerror(char error_text[] ); 
float ss,dss; 

float s [JMAXP] ,h[JMAXP+l] ; These store the successive trapezoidal approxi- 

int j ; mations and their relative stepsizes. 

h[l]=1.0; 

for (j=1;j <=JMAX;j ++) { 

s[j]=trapzd(func,a,b, j) ; 
if (] >= K) { 

polint (&h[j-K] ,&s [j-K] ,K,0.0,&ss,&dss) ; 
if (fabs(dss) <= EPS*fabs(ss)) return ss; 

} 

h[j+l]=0.25*h[j] ; 

This is a key step: The factor is 0.25 even though the stepsize is decreased by only 
0.5. This makes the extrapolation a polynomial in h 2 as allowed by equation (4.2.1), 
not just a polynomial in h. 

> 

nrerrorO'Too many steps in routine qromb") ; 
return 0.0; Never get here. 

> 



The routine qromb, along with its required trapzd and polint, is quite 
powerful for sufficiently smooth (e.g., analytic) integrands, integrated over intervals 
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which contain no singularities, and where the endpoints are also nonsingular, qromb, 
in such circumstances, takes many, many fewer function evaluations than either of 
the routines in §4.2. For example, the integral 


x 4 log (a; + \/x 2 + 1 )dx 


converges (with parameters as shown above) on the very first extrapolation, after 
just 5 calls to trapzd, while qsimp requires 8 calls (8 times as many evaluations of 
the integrand) and qtrap requires 13 calls (making 256 times as many evaluations 
of the integrand). 


CITED REFERENCES AND FURTHER READING: 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
§§3.4-3.5. 

Dahlquist, G., and Bjorck, A. 1974, Numerical Methods (Englewood Cliffs, NJ: Prentice-Hall), 
§§7.4.1-7.4.2. 

Ralston, A., and Rabinowitz, P. 1978, A First Course in Numerical Analysis, 2nd ed. (New York: 
McGraw-Hill), §4.10-2. 


4.4 Improper Integrals 

For our present purposes, an integral will be “improper” if it has any of the 
following problems: 

• its integrand goes to a finite limiting value at finite upper and lower limits, 
but cannot be evaluated right on one of those limits (e.g., sin x/x at x = 0) 

• its upper limit is oo , or its lower limit is — oo 

• it has an integrable singularity at either limit (e.g., x -1 / 2 at x = 0) 

• it has an integrable singularity at a known place between its upper and 
lower limits 

• it has an integrable singularity at an unknown place between its upper 
and lower limits 

If an integral is infinite (e.g., j£° x _1 dx), or does not exist in a limiting sense 
(e.g., cos xdx), we do not call it improper; we call it impossible. No amount of 
clever algorithmics will return a meaningful answer to an ill-posed problem. 

In this section we will generalize the techniques of the preceding two sections 
to cover the first four problems on the above list. A more advanced discussion of 
quadrature with integrable singularities occurs in Chapter 18, notably §18.3. The 
fifth problem, singularity at unknown location, can really only be handled by the 
use of a variable stepsize differential equation integration routine, as will be given 
in Chapter 16. 

We need a workhorse like the extended trapezoidal rule (equation 4.1.11), but 
one which is an open formula in the sense of §4.1, i.e., does not require the integrand 
to be evaluated at the endpoints. Equation (4.1.19), the extended midpoint rule, is the 
best choice. The reason is that (4.1.19) shares with (4.1.11) the “deep” property of 
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having an error series that is entirely even in h. Indeed there is a formula, not as well 
known as it ought to be, called the Second Euler-Maclaurin summation formula. 



This equation can be derived by writing out (4.2.1) with stepsize h, then writing it 
out again with stepsize h/2, then subtracting the first from twice the second. 

It is not possible to double the number of steps in the extended midpoint rule 
and still have the benefit of previous function evaluations (try it!). However, it is 
possible to triple the number of steps and do so. Shall we do this, or double and 
accept the loss? On the average, tripling does a factor \/3 of unnecessary work, 
since the “right” number of steps for a desired accuracy criterion may in fact fall 
anywhere in the logarithmic interval implied by tripling. For doubling, the factor 
is only s/2, but we lose an extra factor of 2 in being unable to use all the previous 
evaluations. Since 1.732 < 2 x 1.414, it is better to triple. 

Here is the resulting routine, which is directly comparable to trapzd. 

#define FUMC(x) ((*func)(x)) 

float midpnt(float (*func)(float), float a, float b, int n) 

This routine computes the nth stage of refinement of an extended midpoint rule, func is input 
as a pointer to the function to be integrated between limits a and b, also input. When called with 
n=l, the routine returns the crudest estimate of f(x)dx. Subsequent calls with n=2,3,... 
(in that sequential order) will improve the accuracy of s by adding (2/3) x 3 n-: *' additional 
interior points, s should not be modified between sequential calls. 

{ 

float x, trim, sum, del, ddel; 
static float s; 
int it, j; 


if (n == 1) { 

return (s=(b-a)*FUNC(0.5*(a+b))); 

> else { 

for(it=l,j=l;j<n-l;j++) it *= 3; 
tnm=it; 

del=(b-a)/(3.0*tnm); 

ddel=del+del; The added points alternate in spacing between 

x=a+0.5*del; del and ddel. 

sum=0.0; 

for (j=l;j<=it;j++) { 
sum += FUNC(x); 
x += ddel; 
sum += FUNC(x); 
x += del; 

1 

s=(s+(b-a)*sum/tnm)/3.0; The new sum is combined with the old integral 

return s; to give a refined integral. 

1 

1 
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The routine midpnt can exactly replace trapzd in a driver routine like qtrap 
(§4.2); one simply changes trapzd(func,a,b, j) to midpnt (func, a,b, j), and 
perhaps also decreases the parameter JMAX since 3 JMAX-l (f rom step tripling) is a 
much larger number than 2^ MAX_1 (step doubling). 

The open formula implementation analogous to Simpson’s rule (qsimp in §4.2) 
substitutes midpnt for trapzd and decreases JMAX as above, but now also changes 
the extrapolation step to be 

s=(9.O*st-ost)/8.0; 

since, when the number of steps is tripled, the error decreases to 1 /9th its size, not 
l/4th as with step doubling. 

Either the modified qtrap or the modified qsimp will fix the first problem 
on the list at the beginning of this section. Yet more sophisticated is to generalize 
Romberg integration in like manner: 

#include <math.h> 

#define EPS 1.0e-6 
#define JMAX 14 
#define JMAXP (JMAX+1) 

#define K 5 

float qromoCfloat (*func)(float), float a, float b, 

float (*choose)(float(*)(float), float, float, int)) 

Romberg integration on an open interval. Returns the integral of the function func from a to b, 
using any specified integrating function choose and Romberg's method. Normally choose will 
be an open formula, not evaluating the function at the endpoints. It is assumed that choose 
triples the number of steps on each call, and that its error series contains only even powers of 
the number of steps. The routines midpnt, midinf, midsql, midsqu, midexp, are possible 
choices for choose. The parameters have the same meaning as in qromb. 

{ 

void polint(float xa[] , float ya[] , int n, float x, float *y, float *dy); 
void nrerror(char error_text []); 
int j; 

float ss,dss,h[JMAXP+l],s[JMAXP]; 
h [1] =1.0; 

for (j = l; j <=JMAX; j +-*-) { 

s[j] = (*choose) (func,a,b, j) ; 
if (j >= K) { 

polint (&h[j-K] ,&s [j-K] ,K,0. 0,&ss,&dss) ; 
if (fabs(dss) <= EPS*fabs(ss)) return ss; 

> 

h[j+l]=h[j]/9.0; This is where the assumption of step tripling and an even 

> error series is used. 

nrerrorC'Too many steps in routing qromo"); 
return 0.0; Never get here. 


Don’t be put off by qromo’s complicated ANSI declaration. A typical invocation 
(integrating the Bessel function Yo ( x ) from 0 to 2) is simply 

#include "nr.h" 
float answer; 

answer=qromo(bessyO,0.0,2.0,midpnt); 



S o- i 
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The differences between qromo and qromb (§4.3) are so slight that it is perhaps 
gratuitous to list qromo in full. It, however, is an excellent driver routine for solving 
all the other problems of improper integrals in our first list (except the intractable 
fifth), as we shall now see. 

The basic trick for improper integrals is to make a change of variables to 
eliminate the singularity, or to map an infinite range of integration to a finite one. 
For example, the identity 


J mdx= C ai> ° 


(4.4.2) 


can be used with either b —> oo and a positive, or with a —> —oo and b negative, and 
works for any function which decreases towards infinity faster than 1/a; 2 . 

You can make the change of variable implied by (4.4.2) either analytically and 
then use (e.g.) qromo and midpnt to do the numerical evaluation, or you can let 
the numerical algorithm make the change of variable for you. We prefer the latter 
method as being more transparent to the user. To implement equation (4.4.2) we 
simply write a modified version of midpnt, called midinf, which allows b to be 
infinite (or, more precisely, a very large number on your particular machine, such 
as 1 x 10 30 ), or a to be negative and infinite. 


#define FUNC(x) ((*funk) (1.0/(x))/((x)*(x))) Effects the change of variable, 
float midinf(float (*funk)(float), float aa, float bb, int n) 

This routine is an exact replacement for midpnt, i.e., returns the nth stage of refinement of 
the integral of funk from aa to bb, except that the function is evaluated at evenly spaced 
points in l/x rather than in x. This allows the upper limit bb to be as large and positive as 
the computer allows, or the lower limit aa to be as large and negative, but not both, aa and 
bb must have the same sign. 

{ 

float x, tnm, sum, del, ddel, b, a; 
static float s; 
int it,j; 

b=1.0/aa; These two statements change the limits of integration. 

a=l.0/bb; 

if (n == 1) { From this point on, the routine is identical to midpnt. 

return (s=(b-a)*FUNC(0.5*(a+b))); 

> else { 

for(it=l,j=l;j<n-l;j++) it *= 3; 
tnm=it; 

del=(b-a)/(3.0*tnm); 
ddel=del+del; 
x=a+0.5*del; 
sum=0.0; 

for (j=l;j<=it;j++) { 
sum += FUNC(x); 
x += ddel; 
sum += FUNC(x); 
x += del; 

> 

return (s=(s+(b-a)*sum/tnm)/3.0); 

> 

> 
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If you need to integrate from a negative lower limit to positive infinity, you do 
this by breaking the integral into two pieces at some positive value, for example, 

answer=qromo(funk,-5.0,2.0,midpnt)+qromo(funk,2.0,1.0e30,midinf); 


Where should you choose the breakpoint? At a sufficiently large positive value so 
that the function funk is at least beginning to approach its asymptotic decrease to 
zero value at infinity. The polynomial extrapolation implicit in the second call to 
qromo deals with a polynomial in 1 /x, not in x. 

To deal with an integral that has an integrable power-law singularity at its lower 
limit, one also makes a change of variable. If the integrand diverges as (x — a) ~ 7 , 
0 < 7 < 1 , near x = a, use the identity 




(6-a) 1 - 


a )dt {b > a) 


If the singularity is at the upper limit, use the identity 


f(x)dx = 


1 


1 -7 Jo 


(b-a) 1 - 


f(b — t 1 -y)dt ( b>a ) 


(4.4.3) 


(4.4.4) 


If there is a singularity at both limits, divide the integral at an interior breakpoint 
as in the example above. 

Equations (4.4.3) and (4.4.4) are particularly simple in the case of inverse 
square-root singularities, a case that occurs frequently in practice: 


/ b r \/b —a 

f(x)dx = J 2 tf(a + t 2 )dt (b > a) (4.4.5) 

for a singularity at a, and 

J f(x)dx = J %tf(b — t 2 )dt (b > a) (4.4.6) 


for a singularity at b. Once again, we can implement these changes of variable 
transparently to the user by defining substitute routines for midpnt which make the 
change of variable automatically: 


#include <math.h> 


#define FUNC(x) (2.0*(x)*(*funk)(aa+(x)*(x))) 

float midsql(float (*funk)(float), float aa, float bb, int n) 

This routine is an exact replacement for midpnt, except that it allows for an inverse square-root 
singularity in the integrand at the lower limit aa. 

{ 

float x,tnm,suiii,del,ddel,a,b; 
static float s; 
int it,j; 

b=sqrt(bb-aa); 
a=0.0; 

if (n == 1) { 

The rest of the routine is exactly like midpnt and is omitted. 
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Similarly, 


#include <math.h> 

#define FUNC(x) (2.0*(x)*(*funk)(bb-(x)*(x))) 

float midsqu(float (*funk)(float), float aa, float bb, int n) 

This routine is an exact replacement for midpnt, except that it allows for an inverse square-root 
singularity in the integrand at the upper limit bb. 

{ 

float x,tnm,sum,del,ddel,a,b; 
static float s; 
int it,j; 

b=sqrt(bb-aa); 
a=0.0; 

if (n == 1) { 

The rest of the routine is exactly like midpnt and is omitted. 

One last example should suffice to show how these formulas are derived in 
general. Suppose the upper limit of integration is infinite, and the integrand falls off 
exponentially. Then we want a change of variable that maps e ~ x dx into (±)dt (with 
the sign chosen to keep the upper limit of the new variable larger than the lower 
limit). Doing the integration gives by inspection 

t = e~ x or x = — logf (4-4.7) 

so that 

J f(x)dx = £ /(-logf)y (4.4.8) 

The user-transparent implementation would be 

#include <math.h> 

#define FUNC(x) ((*funk)(-log(x))/(x)) 

float midexp(float (*funk)(float), float aa, float bb, int n) 

This routine is an exact replacement for midpnt, except that bb is assumed to be infinite 
(value passed not actually used). It is assumed that the function funk decreases exponentially 
rapidly at infinity. 

{ 

float x,tnm,sum,del,ddel,a,b; 
static float s; 
int it,j; 

b=exp(-aa); 
a=0.0; 

if (n == 1) { 

The rest of the routine is exactly like midpnt and is omitted. 



CITED REFERENCES AND FURTHER READING: 

Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), Chapter 4. 


imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



4.5 Gaussian Quadratures and Orthogonal Polynomials 


147 


Dahlquist, G., and Bjorck, A. 1974, Numerical Methods (Englewood Cliffs, NJ: Prentice-Hall), 
§7.4.3, p. 294. 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
§3.7, p. 152. 


4.5 Gaussian Quadratures and Orthogonal 
Polynomials 


In the formulas of §4.1, the integral of a function was approximated by the sum 
of its functional values at a set of equally spaced points, multiplied by certain aptly 
chosen weighting coefficients. We saw that as we allowed ourselves more freedom 
in choosing the coefficients, we could achieve integration formulas of higher and 
higher order. The idea of Gaussian quadratures is to give ourselves the freedom to 
choose not only the weighting coefficients, but also the location of the abscissas at 
which the function is to be evaluated: They will no longer be equally spaced. Thus, 
we will have twice the number of degrees of freedom at our disposal; it will turn out 
that we can achieve Gaussian quadrature formulas whose order is, essentially, twice 
that of the Newton-Cotes formula with the same number of function evaluations. 

Does this sound too good to be true? Well, in a sense it is. The catch is a 
familiar one, which cannot be overemphasized: High order is not the same as high 
accuracy. High order translates to high accuracy only when the integrand is very 
smooth, in the sense of being “well-approximated by a polynomial.” 

There is, however, one additional feature of Gaussian quadrature formulas that 
adds to their usefulness: We can arrange the choice of weights and abscissas to make 
the integral exact for a class of integrands “polynomials times some known function 
W(x)” rather than for the usual class of integrands “polynomials.” The function 
W ( x ) can then be chosen to remove integrable singularities from the desired integral. 
Given W[x), in other words, and given an integer N, we can find a set of weights 
Wj and abscissas x :j such that the approximation 


f b n 

/ W(x)f(x)dx « ^Wjfixj) 

Ja 3 =1 


is exact if f(x) is a polynomial. For example, to do the integral 
f 1 exp(—cos 2 x) 


\/l — x 1 


-dx 


(4.5.1) 


(4.5.2) 


(not a very natural looking integral, it must be admitted), we might well be interested 
in a Gaussian quadrature formula based on the choice 

H'(,r) = / 3 „ (4.5.3) 

VI — x 2 



in the interval (—1,1). (This particular choice is called Gauss- Chebyshev integration, 
for reasons that will become clear shortly.) 
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Notice that the integration formula (4.5.1) can also be written with the weight 
function W(x) not overtly visible: Define g(x') = W(x)f(x) and v 3 = w 3 /W(x 3 ). 
Then (4.5.1) becomes 


rb n 

/ g{x)dxK^ Vj g{x j ) (4.5.4) 

Ja 3=1 

Where did the function W(x) go? It is lurking there, ready to give high-order 
accuracy to integrands of the form polynomials times W (x), and ready to deny high- 
order accuracy to integrands that are otherwise perfectly smooth and well-behaved. 
When you find tabulations of the weights and abscissas for a given W(x), you have 
to determine carefully whether they are to be used with a formula in the form of 
(4.5.1), or like (4.5.4). 

Here is an example of a quadrature routine that contains the tabulated abscissas 
and weights for the case W(x) = 1 and N = 10. Since the weights and abscissas 
are, in this case, symmetric around the midpoint of the range of integration, there 
are actually only five distinct values of each: 


float qgaus(float (*func)(float), float a, float b) 

Returns the integral of the function func between a and b, by ten-point Gauss-Legendre inte¬ 
gration: the function is evaluated exactly ten times at interior points in the range of integration. 
{ 

int j; 

float xr,xm,dx,s; 

static float x []={0.0,0.1488743389,0.4333953941, The abscissas and weights. 

0.6794095682,0.8650633666,0.9739065285}; First value of each array 

static float w[]={0.0,0.2955242247,0.2692667193, not used. 

0.2190863625,0.1494513491,0.0666713443}; 

xm=0.5*(b+a); 
xr=0. 5* (b-a); 

s=0; Will be twice the average value of the function, since the 

for (j=l;j<=5;j++) { ten weights (five numbers above each used twice) 

dx=xr*x[j] ; sum to 2. 

s += w[j]*((*func)(xm+dx)+(*func)(xm-dx)); 

} 

return s *= xr; Scale the answer to the range of integration. 


The above routine illustrates that one can use Gaussian quadratures without 
necessarily understanding the theory behind them: One just locates tabulated weights 
and abscissas in a book (e.g., [1] or [2]). However, the theory is very pretty, and it 
will come in handy if you ever need to construct your own tabulation of weights and 
abscissas for an unusual choice of W(x;). We will therefore give, without any proofs, 
some useful results that will enable you to do this. Several of the results assume that 
W(x) does not change sign inside (a, b), which is usually the case in practice. 

The theory behind Gaussian quadratures goes back to Gauss in 1814, who 
used continued fractions to develop the subject. In 1826 Jacobi rederived Gauss’s 
results by means of orthogonal polynomials. The systematic treatment of arbitrary 
weight functions W(x) using orthogonal polynomials is largely due to Christoffel in 
1877. To introduce these orthogonal polynomials, let us fix the interval of interest 
to be (a, b). We can define the “scalar product of two functions / and g over a 
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weight function W” as 


(fid) = J W(x)f(x)g(x)dx (4.5.5) 


The scalar product is a number, not a function of x. Two functions are said to be 
orthogonal if their scalar product is zero. A function is said to be normalized if its 
scalar product with itself is unity. A set of functions that are all mutually orthogonal 
and also all individually normalized is called an orthonormal set. 

We can find a set of polynomials (i) that includes exactly one polynomial of 
order j, called Pj(x), for each j = 0,1,2,..., and (ii) all of which are mutually 
orthogonal over the specified weight function W(x). A constructive procedure for 
finding such a set is the recurrence relation 


p~i(x) = 0 
p 0 (x) = 1 


Pj+ i(z) = {x- aj)pj(x) - bjPj-i(x) j = 0,1,2,... 


where 


_ (xPj\Pj) 

aj (pM) 

b . = ( Pj\Pj) 

3 < Pj-i\Pj-i > 


j = 0,1,. 
j = 1,2,. 


(4.5.6) 


(4.5.7) 


The coefficient bo is arbitrary; we can take it to be zero. 

The polynomials defined by (4.5.6) are monic, i.e., the coefficient of then- 
leading term [x 3 for Pj(x)] is unity. If we divide each Pj(x) by the constant 
[(pj | pj }\ x / 2 we can render the set of polynomials orthonormal. One also encounters 
orthogonal polynomials with various other normalizations. You can convert from 
a given normalization to monic polynomials if you know that the coefficient of 
x 3 in pj is Xj, say; then the monic polynomials are obtained by dividing each p :/ 
by Xj. Note that the coefficients in the recurrence relation (4.5.6) depend on the 
adopted normalization. 

The polynomial Pj(x) can be shown to have exactly j distinct roots in the 
interval (a, b). Moreover, it can be shown that the roots of Pj(x) “interleave” the 
j — 1 roots of pj-i(x), i.e., there is exactly one root of the former in between each 
two adjacent roots of the latter. This fact comes in handy if you need to find all the 
roots: You can start with the one root of pi{x) and then, in turn, bracket the roots 
of each higher j, pinning them down at each stage more precisely by Newton’s rule 
or some other root-finding scheme (see Chapter 9). 

Why would you ever want to find all the roots of an orthogonal polynomial 
Pj{x)l Because the abscissas of the Appoint Gaussian quadrature formulas (4.5.1) 
and (4.5.4) with weighting function W (a;) in the interval (a, b) are precisely the roots 
of the orthogonal polynomial pn(x ) for the same interval and weighting function. 
This is the fundamental theorem of Gaussian quadratures, and lets you find the 
abscissas for any particular case. 
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Once you know the abscissas x\,... ,xn, you need to find the weights wj, 
j = 1 ,,N. One way to do this (not the most efficient) is to solve the set of 
linear equations 


' Po(a>i) 

Po(xn ) ' 

' toi ' 


'f^W(x)p 0 (x)dx' 

Pi(xi) 

Pi(xn) 

W-2 

= 

0 

-p N -i(xi) . 

■ p N -i(xn) - 

-W N . 


0 


Equation (4.5.8) simply solves for those weights such that the quadrature (4.5.1) 
gives the correct answer for the integral of the first N orthogonal polynomials. Note 
that the zeros on the right-hand side of (4.5.8) appear because p i (x) ,..., pn-i(x) 
are all orthogonal to po(x), which is a constant. It can be shown that, with those 
weights, the integral of the next N—l polynomials is also exact, so that the quadrature 
is exact for all polynomials of degree 2N — 1 or less. Another way to evaluate the 
weights (though one whose proof is beyond our scope) is by the formula 

PN-l(Xj)P N (Xj) 

where p' N (xj) is the derivative of the orthogonal polynomial at its zero xj. 

The computation of Gaussian quadrature rules thus involves two distinct phases: 
(i) the generation of the orthogonal polynomials p o,... ,Pn, i.e., the computation of 
the coefficients aj, bj in (4.5.6); (ii) the determination of the zeros of p n(x), and 
the computation of the associated weights. For the case of the “classical” orthogonal 
polynomials, the coefficients aj and bj are explicitly known (equations 4.5.10 - 
4.5.14 below) and phase (i) can be omitted. However, if you are confronted with a 
“nonclassical” weight function W(x), and you don’t know the coefficients aj and 
bj, the construction of the associated set of orthogonal polynomials is not trivial. 
We discuss it at the end of this section. 

Computation of the Abscissas and Weights 


This task can range from easy to difficult, depending on how much you already 
know about your weight function and its associated polynomials. In the case of 
classical, well-studied, orthogonal polynomials, practically everything is known, 
including good approximations for their zeros. These can be used as starting guesses, 
enabling Newton’s method (to be discussed in §9.4) to converge very rapidly. 
Newton’s method requires the derivative p' N {x), which is evaluated by standard 
relations in terms of pn and pn-i- The weights are then conveniently evaluated by 
equation (4.5.9). For the following named cases, this direct root-finding is faster, 
by a factor of 3 to 5, than any other method. 

Here are the weight functions, intervals, and recurrence relations that generate 
the most commonly used orthogonal polynomials and their corresponding Gaussian 
quadrature formulas. 
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Gauss-Legendre: 


Gauss-Chebyshev: 


Gauss-Laguerre: 


W(x) = 1 — 1 < £ < 1 

(j + l)-Pj+i = (2j + 1 )xPj — jPj- 


W(x) = (1 - a: 2 )- 1 / 2 - 1 < x < 1 


T i+ 1 = 2xTj - Ti 


W(x) = x a e x 0 < x < oo 


(i + 1 )■£“+! — (—a: + 2j + a + 1)L" — (j + a)Lf_ 


Gauss-Hermite: 


Gauss-Jacobi: 


H j+1 = 2xHj - 2jHj. 


W(x) = (1 — £) a (l + x)P - l < x <1 


ijP&f = (dj + e jX )P} a ’ p> - fjP^f (4.5.14) 

where the coefficients Cj,dj,ej, and fj are given by 
Cj = 2(j + l)(j + a 4- jSM 1 l)(2j + a + /?) 
dj = (2j + ot + (3 + l)(a 2 — /3 2 ) 

ej = (2j + a + P)(2j + a + P + l)(2j + a + p+2) ' 4 “' 

fj = 2 (j + a) (j + P) (2j + a + P + 2) 

We now give individual routines that calculate the abscissas and weights for 
these cases. First comes the most common set of abscissas and weights, those of 
Gauss-Legendre. The routine, due to G.B. Rybicki, uses equation (4.5.9) in the 
special form for the Gauss-Legendre case, 


" M)IW 

The routine also scales the range of integration from (x i, X 2 ) to (— 1,1), and provides 
abscissas x :j and weights Wj for the Gaussian formula 


[X2 N 

/ f(x)dx = ^2 w jf(x j ) 
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#include <math.h> 

#define EPS 3.0e-ll EPS is the relative precision. 


void gauleg(float xl, float x2, float x[], float w[] , int n) 

Given the lower and upper limits of integration xl and x2, and given n, this routine returns 
arrays x[l. .n] and w[l. .n] of length n, containing the abscissas and weights of the Gauss- 
Legendre n-point quadrature formula. 

{ 

int m,j,i; 

double zl,z,xm,xl,pp > p3 > p2,pl; High precision is a good idea for this rou¬ 

tine. 

m=(n+l)/2; The roots are symmetric in the interval, so 

xm=0.5*(x2+xl) ; we only have to find half of them. 

xl=0.5*(x2-xl); 

for (i=l;i<=m;i++) { Loop over the desired roots. 

z=cos(3.141592654* (i-0 .25)/ (n+0. 5) ); 

Starting with the above approximation to the ith root, we enter the main loop of 
refinement by Newton's method, 
do { 

pl=l.0; 


> 


p2=0.0; 

for (j=l;j<=n;j++) { Loop up the recurrence relation to get the 

p3=p2; Legendre polynomial evaluated at z. 

p2=pl; 

pl=((2.0*j-1.0)*z*p2-(j-1.0)*p3)/j; 

> 

pi is now the desired Legendre polynomial. We next compute pp, its derivative, 
by a standard relation involving also p2, the polynomial of one lower order. 
pp=n*(z*pl-p2)/(z*z-l.0); 


zl=z; 

z=zl-pl/pp; 

> while (fabs(z-zl) > EPS); 
x[i]=xm-xl*z; 
x [n+l-i]=xm+xl*z; 
w [i]=2.0*xl/((1.0-z*z)*pp*pp); 
w [n+l-i] =w[i]; 


Newton's method. 

Scale the root to the desired interval, 
and put in its symmetric counterpart. 
Compute the weight 
and its symmetric counterpart. 


> 


Next we give three routines that use initial approximations for the roots given 
by Stroud and Secrest [2], The first is for Gauss-Laguerre abscissas and weights, to 
be used with the integration formula 


X f{x)dx = ^2 Wj f(xj) 

j= i 


(4.5.18) 


#include <math.h> 

#define EPS 3.0e-14 Increase EPS if you don’t have this preci- 

#define MAXIT 10 sion. 

void gaulag(float x[], float w[], int n, float alf) 

Given alf, the parameter a of the Laguerre polynomials, this routine returns arrays x[l. .n] 
and w[l. .n] containing the abscissas and weights of the n-point Gauss-Laguerre quadrature 
formula. The smallest abscissa is returned in x[l], the largest in x[n], 

{ 

float gammln(float xx); 
void nrerror(char error_text[]); 
int i,its,j; 
float ai; 
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double pl,p2 ) p3,pp,z,zl; High precision is a good idea for 

tine. 

for (i=l;i<=n;i++) { Loop over the desired roots. 

if (i == 1) { Initial guess for the smallest root. 

z=(1.O+alf)*(3.0+0.92*alf)/(1.0+2.4*n+l.8*alf); 

> else if (i == 2) { Initial guess for the second root. 

z += (15.0+6.25*alf)/(l.0+0.9*alf+2.5*n); 

> else { Initial guess for the other roots. 

ai=i-2; 

z += ((1.0+2.55*ai)/(l.9*ai)+l.26*ai*alf/ 

(1.0+3.5*ai))*(z-x[i-2])/(1.0+0.3*alf); 


oop up the recurrence relation to get the 
Laguerre polynomial evaluated at z. 


: (its=l;its<=MAXIT;its++) { Refinement by Newton's method. 
pl=l.0; 

p2=0.0; 

for (j=l;j<=n;j++) { Loop up the recurrence relation to get the 

p3=p2; Laguerre polynomial evaluated at z. 

p2=pl; 

pl=((2*j-l+alf-z)*p2-(j-1+alf)*p3)/j; 

> 

pi is now the desired Laguerre polynomial. We next compute pp, its derivative, 
by a standard relation involving also p2, the polynomial of one lower order. 
pp=(n*pl-(n+alf)*p2)/z; 


if (fabs(z-zl) <= EPS) break; 

> 

if (its > MAXIT) nrerrorC'too many iterations in gaulag"); 
x[i]=z; Store the root and the weight. 

w[i] = -exp(gammln(alf+n)-gammln((float)n))/(pp*n*p2); 


Next is a routine for Gauss-Hermite abscissas and weights. If we use the 
“standard” normalization of these functions, as given in equation (4.5.13), we find 
that the computations overflow for large N because of various factorials that occur. 
We can avoid this by using instead the orthonormal set of polynomials Hj. They 
are generated by the recurrence 


ff-i =0, H 0 =- 




The formula for the weights becomes 


while the formula for the derivative with this normalization is 

H' = JljTlj (4.5.21) 

The abscissas and weights returned by gauher are used with the integration formula 


r _ 2 " 

J e ~ x f(x)dx = ^2 Wjf(xj) 
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#include <math.h> 

#define EPS 3.0e-14 

#define PIM4 0.7511265444649425 

#define MAXIT 10 


Relative precision. 

1/7T 1 / 4 . 

Maximum iterations. 


void gauher(float x[], float w[], int n) 

Given n, this routine returns arrays x [1. .n] and w[l. .n] containing the abscissas and weights 
of the n-point Gauss-Hermite quadrature formula. The largest abscissa is returned in x[l] , the 
most negative in x[n], 

{ 

void nrerrorfchar error_text []); 
int i,its,j,m; 

double pl,p2,p3,pp,z,zl; High precision is a good idea for this rou¬ 

tine. 

m=(n+l)/2; 

The roots are symmetric about the origin, so we have to find only half of them, 
for (i=l;i<=m;i++) { Loop over the desired roots. 

if (i == 1) { Initial guess for the largest root. 

z=sqrt((double)(2*n+l))-1.85575*pow((double)(2*n+l),-0.16667); 

> else if (i == 2) { Initial guess for the second largest root. 

z -= 1.14*pow((double)n,0.426)/z; 

> else if (i == 3) { 


Initial guess for the third largest root. 
Initial guess for the fourth largest root. 
Initial guess for the other roots. 


=1.86*z-0.86*x [1]; 

> else if (i == 4) { 
z=l.91*z-0.91*x [2]; 

} else { 

z=2.0*z-x[i-2]; 

} 

for (its=l;its<=MAXIT;its++) { 
pl=PIM4; 

p2=0.0; 

for (j=l;j<=n;j++) { 
p3=p2; 
p2=pl; 

pl=z*sqrt(2.0/j)*p2-sqrt(((double)(j —1))/j)*p3; 

} 

pi is now the desired Hermite polynomial. We next compute pp, its derivative, by 
the relation (4.5.21) using p2, the polynomial of one lower order. 
pp=sqrt((double)2*n)*p2; 
zl=z; 

z=zl-pl/pp; Newton's formula, 

if (fabs(z-zl) <= EPS) break; 

} 

if (its > MAXIT) nrerrorC'too many iterations in gauher"); 
x[i]=z; Store the root 

x[n+l-i] = -z; and its symmetric counterpart. 

w[i]=2.0/(pp*pp); Compute the weight 

w[n+l-i]=w[i] ; and its symmetric counterpart. 


Refinement by Newton’s method. 


Loop up the recurrence relation to get 
the Hermite polynomial evaluated at 


Finally, here is a routine for Gauss-Jacobi abscissas and weights, which 
implement the integration formula 

r 1 N 

/ (l-a:) a (l + x) 0 f(x)dx = ^2w j f{x j ) (4.5.23) 

J ~ 1 j =i 



s o- i 
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#include <math.h> 

#define EPS 3.0e-14 Increase EPS if you don’t have this preci- 

#define MAXIT 10 sion. 


void gaujac(float x[], float w[], int n, float alf, float bet) 

Given alf and bet, the parameters a and /3 of the Jacobi polynomials, this routine returns 
arrays x[l. ,n] andw[l. ,n] containing the abscissas and weights of the n-point Gauss-Jacobi 
quadrature formula. The largest abscissa is returned in x[l], the smallest in x[n], 

{ 


float gammln(float xx); 

void nrerror(char error_text[]); 

int i,its,j; 

float alfbet,an,bn,rl,r2,r3; 

double a,b > c ) pl,p2 > p3 > pp,temp,z,zl; High precision is a good idea for this rou¬ 
tine. 

for (i=l;i<=n;i++) { Loop over the desired roots, 

if (i == 1) { Initial guess for the largest root. 

an=alf/n; 
bn=bet/n; 

r1=(1.O+alf)*(2.78/(4.0+n*n)+0.768*an/n); 
r2=l.0+1.48*an+0.96*bn+0.452*an*an+0.83*an*bn; 
z=1.0-rl/r2; 

> else if (i == 2) { Initial guess for the second largest root. 

r1=(4.1+alf)/((1.0+alf)*(1.0+0.156*alf)); 
r2=l.0+0.06*(n-8.0)*(1.0+0.12*alf)/n; 
r3=l.0+0.012*bet*(1.0+0.25*fabs(alf))/n; 
z -= (1.0-z)*rl*r2*r3; 

> else if (i == 3) { Initial guess for the third largest root. 

rl=(l.67+0.28*alf)/(1.0+0.37*alf); 
r2=l.0+0.22*(n-8,0)/n; 
r3=l.0+8.0*bet/((6.28+bet)*n*n); 
z -= (x[l]-z)*rl*r2*r3; 

> else if (i == n-1) { Initial guess for the second smallest root. 

rl=(l.0+0.235*bet)/(0.766+0.119*bet); 

r2=l.0/(1.0+0.639*(n-4.0)/(1.0+0.71*(n-4.0))); 

r3=l.0/(1.0+20.0*alf/((7.5+alf)*n*n)); 

z += (z-x[n-3])*rl*r2*r3; 

> else if (i == n) { Initial guess for the smallest root. 

rl=(l.0+0.37*bet)/(1.67+0.28*bet); 
r2=l.0/(1.0+0.22*(n-8.0)/n); 
r3=l.0/(1.0+8.0*alf/((6.28+alf)*n*n)); 
z += (z-x[n-2])*rl*r2*r3; 

> else { Initial guess for the other roots. 

z=3.0*x[i-1]-3.0*x[i-2]+x [i-3] ; 


alfbet=alf+bet; 

for (its=l;its<=MAXIT;its++) { 
temp=2.0+alfbet; 
pl=(alf-bet+temp*z)/2.0; 
p2=l.0; 

for (j=2;j<=n;j++) { 
p3=p2; 
p2=pi; 


Refinement by Newton's method. 

Start the recurrence with Po and Pi to avoid 
a division by zero when a + /3 = 0 or 
- 1 . 

Loop up the recurrence relation to get the 
Jacobi polynomial evaluated at z. 


temp=2*j+alfbet; 
a=2* j *(j+alfbet)*(temp-2.0) 
b=(temp-1.0)*(alf *alf-bet*bet+temp*(temp-2.0)*z); 
c=2.0*(j-1+alf)*(j-l+bet)*temp; 
pl=(b*p2-c*p3)/a; 


> 


pp=(n*(alf-bet-temp*z)*pl+2.0*(n+alf)*(n+bet)*p2)/(temp*(1.0-z*z)); 
pi is now the desired Jacobi polynomial. We next compute pp, its derivative, by 
a standard relation involving also p2, the polynomial of one lower order. 
zl=z; 

z=zl-pl/pp; 



Newton's formula. 
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if (fabs(z-zl) <= EPS) break; 

> 

if (its > MAXIT) nrerror("too many iterations in gaujac"); 
x[i]=z; Store the root and the weight. 

w[i]=exp(gammln(alf+n)+gammln(bet+n)-gammln(n+l.0)- 

gammln(n+alfbet+l.0))*temp*pow(2.0,alfbet)/(pp*p2); 

} 

> 


Legendre polynomials are special cases of Jacobi polynomials with a = /3 = 0, 
but it is worth having the separate routine for them, gauleg, given above. Chebyshev 
polynomials correspond to a = /3 = —1/2 (see §5.8). They have analytic abscissas 
and weights: 



(4.5.24) 


Wj 


N 


Case of Known Recurrences 


Turn now to the case where you do not know good initial guesses for the zeros of your 
orthogonal polynomials, but you do have available the coefficients Oj and b :l that generate 
them. As we have seen, the zeros of pn(x) are the abscissas for the A’-point Gaussian 
quadrature formula. The most useful computational formula for the weights is equation 
(4.5.9) above, since the derivative p' N can be efficiently computed by the derivative of (4.5.6) 
in the general case, or by special relations for the classical polynomials. Note that (4.5.9) is 
valid as written only for monic polynomials; for other normalizations, there is an extra factor 
of Xn/Xn-i, where A n is the coefficient of x N in p.v. 

Except in those special cases already discussed, the best way to find the abscissas is not 
to use a root-finding method like Newton’s method on pn(x). Rather, it is generally faster 
to use the Golub-Welsch [3] algorithm, which is based on a result of Wilf [4], This algorithm 
notes that if you bring the term xpj to the left-hand side of (4.5.6) and the term pj +1 to the 
right-hand side, the recurrence relation can be written in matrix form as 



or 


xp = T p + p N e n-i (4.5.25) 

Here T is a tridiagonal matrix, p is a column vector of po,pi, ■ ■ ■ ,Pn-i, and e/v-i is a unit 
vector with a 1 in the (N — l)st (last) position and zeros elsewhere. The matrix T can be 
symmetrized by a diagonal similarity transformation D to give 

do y/bi 
y/bi oi 

J = DTD 1 = 


The matrix J is called the Jacobi matrix (not to be confused with other matrices named 
after Jacobi that arise in completely different problems!). Now we see from (4.5.25) that 
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PN(xj) = 0 is equivalent to Xj being an eigenvalue of T. Since eigenvalues are preserved 
by a similarity transformation, Xj is an eigenvalue of the symmetric tridiagonal matrix J. 
Moreover, Wilf [4] shows that if \j is the eigenvector corresponding to the eigenvalue x :l , 
normalized so that v • v = 1, then 

Wj = fj,oVj tl (4.5.27) 

where 

Po = J W ( x) dx (4.5.28) 

and where Vj,i is the first component of v. As we shall see in Chapter 11, finding all 
eigenvalues and eigenvectors of a symmetric tridiagonal matrix is a relatively efficient and 
well-conditioned procedure. We accordingly give a routine, gaucof, for finding the abscissas 
and weights, given the coefficients % and bj. Remember that if you know the recurrence 
relation for orthogonal polynomials that are not normalized to be monic, you can easily convert 
it to monic form by means of the quantities A j. 

#include <math.h> 

#include "nrutil.h" 

void gaucof (int n, float a[] , float b[], float amuO, float x[], float w[]) 
Computes the abscissas and weights for a Gaussian quadrature formula from the Jacobi matrix. 
On input, a[l. .n] and b[l. .n] are the coefficients of the recurrence relation for the set of 
monic orthogonal polynomials. The quantity po = W(x) dx is input as amuO. The abscissas 
x[l. .n] are returned in descending order, with the corresponding weights in w[l. .n]. The 
arrays a and b are modified. Execution can be speeded up by modifying tqli and eigsrt to 
compute only the first component of each eigenvector. 

{ 

void eigsrt (float d[] , float **v, int n); 
void tqli(float d[] , float e[], int n, float **z) ; 
int i, j ; 
float **z; 

z=matrix(l,n,l,n); 
for (i=l;i<=n;i++) { 

if (i != 1) b[i]=sqrt(b[i]); Set up superdiagonal of Jacobi matrix, 

for (j=l; j<=n; j++) z [i] [j] = (f loat) (i == j); 

Set up identity matrix for tqli to compute eigenvectors. 

> 

tqli(a,b,n,z); 

eigsrt(a,z,n); Sort eigenvalues into descending order, 

for (i=l;i<=n;i++) { 
x[i]=a[i] ; 

w[i] =amuO*z[1] [i]*z[1] [i] ; Equation (4.5.27). 

> 

free_matrix(z,l,n,l,n); 

> 


Orthogonal Polynomials with Nonclassical Weights 

This somewhat specialized subsection will tell you what to do if your weight function 
is not one of the classical ones dealt with above and you do not know the % ’s and 6/s 
of the recurrence relation (4.5.6) to use in gaucof. Then, a method of finding the a/s 
and bj ’s is needed. 

The procedure of Stieltjes is to compute ao from (4.5.7), then pi(x) from (4.5.6). 
Knowing po and p i, we can compute oi and b i from (4.5.7), and so on. But how are we 
to compute the inner products in (4.5.7)? 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 




158 


Chapter 4. Integration of Functions 


The textbook approach is to represent each pj (x) explicitly as a polynomial in x and 
to compute the inner products by multiplying out term by term. This will be feasible if we 
know the first 2N moments of the weight function, 


p,j — J x?W(x)dx j = 0,1,, 2N — 1 


(4.5.29) 


However, the solution of the resulting set of algebraic equations for the coefficients Oj and bj 
in terms of the moments m is in general extremely ill-conditioned. Even in double precision, 
it is not unusual to lose all accuracy by the time N = 12. We thus reject any procedure 
based on the moments (4.5.29). 

Sack and Donovan [5] discovered that the numerical stability is greatly improved if, 
instead of using powers of a: as a set of basis functions to represent the p/s, one uses some 
other known set of orthogonal polynomials nj(x), say. Roughly speaking, the improved 
stability occurs because the polynomial basis “samples” the interval ( a,b ) better than the 
power basis when the inner product integrals are evaluated, especially if its weight function 
resembles W(x). 

So assume that we know the modified moments 

Vj = J 7 tj{x)W(x)dx j = 0,1,..., 21V — 1 (4.5.30) 

where the 7r/s satisfy a recurrence relation analogous to (4.5.6), 

7r_i(a;) = 0 


Ti-o (x) s 1 (4.5.31) 

” , • i (a’) - (a: - - /3jirj-i(x) j = 0,1,2,... 


and the coefficients aj, /3j are known explicitly. Then Wheeler [6] has given an efficient 
0(N 2 ) algorithm equivalent to that of Sack and Donovan for finding aj and bj via a set 
of intermediate quantities 


Ok,i = {Pk\tti) k,l>-1 

Initialize 

<T-i,i=0 l = 1,2,..., 21V- 2 

<To,i = vi l = 0,1,...,2N- 1 

Vl 

ao = cko H- 

vo 

b o = 0 

Then, for k = 1, 2,.. ., N — 1, compute 

CTk,l = CTfc-l.i+l — {flk-1 ~ Oil)Ok-l,l — bk-l(Tk-2,l + 


(4.5.32) 


(4.5.33) 


Ofc = Q-k — 


&k-i,k _j_ <?k,k+i 


bk 


gfc.fc 


l = k,k + l,...,2N-k-l 


(4.5.34) 


Note that the normalization factors can also easily be computed if needed: 
(po\po) = v 0 

{Pj I Pj) = bj (pj i \pj-i) j = 1,2,... 


You can find a derivation of the above algorithm in Ref. [7], 

Wheeler’s algorithm requires that the modified moments (4.5.30) be accurately computed. 
In practical cases there is often a closed form, or else recurrence relations can be used. The 
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algorithm is extremely successful for finite intervals (a, b). For infinite intervals, the algorithm 
does not completely remove the ill-conditioning. In this case, Gautschi [8,9] recommends 
reducing the interval to a finite interval by a change of variable, and then using a suitable 
discretization procedure to compute the inner products. You will have to consult the 
references for details. 

We give the routine orthog for generating the coefficients a, and bj hy Wheeler’s 
algorithm, given the coefficients a :l and d :t , and the modified moments Vj. For consistency 
with gaucof, the vectors a, /3, a and b are 1-based. Correspondingly, we increase the indices 
of the a matrix by 2, i.e., sig[k,l] = <jk- 2 ,i- 1 - 

#include "nrutil.h" 

void orthog(int n, float anu[] , float alpha[] , float beta[] , float a[] , 
float b []) 

Computes the coefficients a,j and bj, J = — of the recurrence relation for monic 

orthogonal polynomials with weight function W(x ) by Wheeler's algorithm. On input, the arrays 
alpha [1. . 2*n-l] and beta[l. .2*n-l] are the coefficients <Xj and ftj, j = 0,... 2N — 2, of 
the recurrence relation for the chosen basis of orthogonal polynomials. The modified moments 
Vj are input in anu[l. . 2*n] . The first n coefficients are returned in a[l. .n] and b [1. .n] . 
f 

int k,l; 
float **sig; 
int looptmp; 

sig=matrix(l,2*n+l,l,2*n+l); 
looptmp=2*n; 

for (l=3;l <= looptnip;l ++ ) sig[l] [1]=0.0; Initialization, Equation (4.5.33). 

looptmp++; 

for (l=2;l<=looptmp;l++) sig[2] [l]=anu[l-l] ; 

a[l]=alpha[l]+anu[2] /anu[l] ; 

b[l]=0.0; 

for (k=3;k<=n+l;k++) { Equation (4.5.34). 

looptmp=2*n-k+3; 
for (l=k; l<=looptmp; 1++) { 

sig[k] [l]=sig[k-l] [1+1] + (alpha[1-1]-a[k-2])*sig[k-l] [1]- 
b [k-2] *sig[k-2] [1]+beta[l-l] *sig[k-l] [1-1] ; 

> 

a[k-l] =alpha[k-1] +sig[k] [k+l]/sig[k] [k]-sig[k-l] [k]/sig[k-l] [k-1] ; 
b[k-l]=sig[k] [k] /sig[k-l] [k-1] ; 

> 

free_matrix(sig,l,2*n+l,l,2*n+l); 

> 


As an example of the use of orthog, consider the problem [7] of generating orthogonal 
polynomials with the weight function W(x) = — log x on the interval (0,1). A suitable set 
of TTj’s is the shifted Legendre polynomials 

(i 1 ) 2 

7r j = ^y p j(2x-l) (4.5.36) 

The factor in front of Pj makes the polynomials monic. The coefficients in the recurrence 
relation (4.5.31) are 




4(4 — j~ 2 ) 


3 = 0 , 1 , . . . 

3 = 1 , 2 ,... 


(4.5.37) 


while the modified moments are 


= 1 (-i ym 2 


m + OC 2 ;)! 



o 

3> 1 


(4.5.38) 
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A call to orthog with this input allows one to generate the required polynomials to machine 
accuracy for very large N, and hence do Gaussian quadrature with this weight function. Before 
Sack and Donovan’s observation, this seemingly simple problem was essentially intractable. 

Extensions of Gaussian Quadrature 

There are many different ways in which the ideas of Gaussian quadrature have 
been extended. One important extension is the case of preassigned nodes : Some 
points are required to be included in the set of abscissas, and the problem is to choose 
the weights and the remaining abscissas to maximize the degree of exactness of the 
the quadrature rule. The most common cases are Gauss-Radau quadrature, where 
one of the nodes is an endpoint of the interval, either a or 6, and Gauss-Lobatto 
quadrature, where both a and b are nodes. Golub [10] has given an algorithm similar 
to gaucof for these cases. 

The second important extension is the Gauss-Kronrod formulas. For ordinary 
Gaussian quadrature formulas, as N increases the sets of abscissas have no points 
in common. This means that if you compare results with increasing N as a way of 
estimating the quadrature error, you cannot reuse the previous function evaluations. 
Kronrod [11 ] posed the problem of searching for optimal sequences of rules, each 
of which reuses all abscissas of its predecessor. If one starts with N = m, say, 
and then adds n new points, one has 2n + m free parameters: the n new abscissas 
and weights, and m new weights for the fixed previous abscissas. The maximum 
degree of exactness one would expect to achieve would therefore be 2n + m — 1. 
The question is whether this maximum degree of exactness can actually be achieved 
in practice, when the abscissas are required to all lie inside (a, b). The answer to 
this question is not known in general. 

Kronrod showed that if you choose n = m + 1, an optimal extension can 
be found for Gauss-Legendre quadrature. Patterson [12] showed how to compute 
continued extensions of this kind. Sequences such as N = 10,21,43,87,... are 
popular in automatic quadrature routines [13] that attempt to integrate a function until 
some specified accuracy has been achieved. 


CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. f964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), §25.4. [1] 

Stroud, A.H., and Secrest, D. 1966, Gaussian Quadrature Formulas (Englewood Cliffs, NJ: 
Prentice-Hall). [2] 

Golub, G.H., and Welsch, J.H. 1969, Mathematics of Computation, vol. 23, pp. 221-230 and 
A1-A10. [3] 

Wilf, H.S. 1962, Mathematics for the Physical Sciences (New York: Wiley), Problem 9, p. 80. [4] 
Sack, R.A., and Donovan, A.F. 1971/72, Numerische Mathematik, vol. 18, pp. 465-478. [5] 
Wheeler, J.C. 1974, Rocky Mountain Journal of Mathematics, vol. 4, pp. 287-296. [6] 

Gautschi, W. 1978, in Recent Advances in Numerical Analysis, C. de Boor and G.H. Golub, eds. 
(New York: Academic Press), pp. 45-72. [7] 

Gautschi, W. 1981, in E.B. Christoffel, PL. Butzer and F. Feher, eds. (Basel: Birkhauser Verlag), 
pp. 72-147. [8] 

Gautschi, W. 1990, in Orthogonal Polynomials, P. Nevai, ed. (Dordrecht: Kluwer Academic Pub¬ 
lishers), pp. 181-216. [9] 
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Golub, G.H. 1973, SIAM Review, vol. 15, pp. 318-334. [10] 

Kronrod, A.S. 1964, Doklady Akademii Nauk SSSR, vol. 154, pp. 283-286 (in Russian). [11] 
Patterson, T.N.L. 1968, Mathematics of Computation, vol. 22, pp. 847-856 and Cl—Cl 1; 1969, 
op. cit., vol. 23, p. 892. [12] 

Piessens, R., de Doncker, E., Uberhuber, C.W., and Kahaner, D.K. 1983, QUADPACK: A Sub¬ 
routine Package for Automatic Integration (New York: Springer-Verlag). [13] 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
§3.6. 

Johnson, L.W., and Riess, R.D. 1982, Numerical Analysis, 2nd ed. (Reading, MA: Addison- 
Wesley), §6.5. 

Carnahan, B., Luther, H.A., and Wilkes, J.O. 1969, Applied Numerical Methods (New York: 
Wiley), §§2.9-2.10. 

Ralston, A., and Rabinowitz, P. 1978, A First Course in Numerical Analysis, 2nd ed. (New York: 
McGraw-Hill), §§4.4-4.8. 



4.6 Multidimensional Integrals 

Integrals of functions of several variables, over regions with dimension greater 
than one, are not easy. There are two reasons for this. First, the number of function 
evaluations needed to sample an iV-dimensional space increases as the iVth power 
of the number needed to do a one-dimensional integral. If you need 30 function 
evaluations to do a one-dimensional integral crudely, then you will likely need on 
the order of 30000 evaluations to reach the same crude level for a three-dimensional 
integral. Second, the region of integration in iV-dimensional space is defined by 
an N — 1 dimensional boundary which can itself be terribly complicated: It need 
not be convex or simply connected, for example. By contrast, the boundary of a 
one-dimensional integral consists of two numbers, its upper and lower limits. 

The first question to be asked, when faced with a multidimensional integral, 
is, “can it be reduced analytically to a lower dimensionality?” For example, 
so-called iterated integrals of a function of one variable /(f) can be reduced to 
one-dimensional integrals by the formula 




dt n -i ■ ■ ■ / dt2 / f(ti)dti 

Jo Jo 

Jo ' 


Alternatively, the function may have some special symmetry in the way it depends 
on its independent variables. If the boundary also has this symmetry, then the 
dimension can be reduced. In three dimensions, for example, the integration of a 
spherically symmetric function over a spherical region reduces, in polar coordinates, 
to a one-dimensional integral. 

The next questions to be asked will guide your choice between two entirely 
different approaches to doing the problem. The questions are: Is the shape of the 
boundary of the region of integration simple or complicated? Inside the region, is 
the integrand smooth and simple, or complicated, or locally strongly peaked? Does 
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the problem require high accuracy, or does it require an answer accurate only to 
a percent, or a few percent? 

If your answers are that the boundary is complicated, the integrand is not 
strongly peaked in very small regions, and relatively low accuracy is tolerable, then 
your problem is a good candidate for Monte Carlo integration. This method is very 
straightforward to program, in its cruder forms. One needs only to know a region 
with simple boundaries that includes the complicated region of integration, plus a 
method of determining whether a random point is inside or outside the region of 
integration. Monte Carlo integration evaluates the function at a random sample of 
points, and estimates its integral based on that random sample. We will discuss it in 
more detail, and with more sophistication, in Chapter 7. 

If the boundary is simple, and the function is very smooth, then the remaining 
approaches, breaking up the problem into repeated one-dimensional integrals, or 
multidimensional Gaussian quadratures, will be effective and relatively fast [1]. If 
you require high accuracy, these approaches are in any case the only ones available 
to you, since Monte Carlo methods are by nature asymptotically slow to converge. 

For low accuracy, use repeated one-dimensional integration or multidimensional 
Gaussian quadratures when the integrand is slowly varying and smooth in the region 
of integration, Monte Carlo when the integrand is oscillatory or discontinuous, but 
not strongly peaked in small regions. 

If the integrand is strongly peaked in small regions, and you know where those 
regions are, break the integral up into several regions so that the integrand is smooth 
in each, and do each separately. If you don’t know where the strongly peaked regions 
are, you might as well (at the level of sophistication of this book) quit: It is hopeless 
to expect an integration routine to search out unknown pockets of large contribution 
in a huge iV-dimensional space. (But see §7.8.) 


If, on the basis of the above guidelines, you decide to pursue the repeated one¬ 
dimensional integration approach, here is how it works. For definiteness, we will 
consider the case of a three-dimensional integral in x, y, 2 -space. Two dimensions, 
or more than three dimensions, are entirely analogous. 

The first step is to specify the region of integration by (i) its lower and upper 
limits in x, which we will denote X\ and x- 2 ', (ii) its lower and upper limits in y at 
a specified value of x, denoted y-\ (x) and y%{x)', and (iii) its lower and upper limits 
in 2 at specified x and y, denoted z\{x,y) and 22 (x, y). In other words, find the 
numbers x\ and a:2, and the functions 'yi(x). y-2(x), Z\(x, y), and 22(x, y) such that 


1 = J J J dxdydzf(x,y,z) 


rx 2 rV2(x) rz 2 {x,y) 

- dx dy 

7xi Jyi(x) d zi(x,\ 


(4.6.2) 


dz f(x,y,z) 


For example, a two-dimensional integral over a circle of radius one centered on 
the origin becomes 



f i 


(4.6.3) 
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and finally our answer as an integral over H(x) 

rx 2 

1= H(x)dx (4.6.6) 

In an implementation of equations (4.6.4)-(4.6.6), some basic one-dimensional 
integration routine (e.g., qgaus in the program following) gets called recursively: 
once to evaluate the outer integral I , then many times to evaluate the middle integral 
H, then even more times to evaluate the inner integral G (see Figure 4.6.1). Current 
values of x and y, and the pointer to your function f unc, are passed “over the head” 
of the intermediate calls through static top-level variables. 
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static float xsav,ysav; 

static float (*nrfunc)(float.float.float); 

float quad3d(float (*func)(float, float, float), float xl, float x2) 

Returns the integral of a user-supplied function func over a three-dimensional region specified 
by the limits xl, x2, and by the user-supplied functions yyl, yy2, zl, and z2, as defined in 
(4.6.2). (The functions yi and y 2 are here called yyl and yy2 to avoid conflict with the names 
of Bessel functions in some C libraries). Integration is performed by calling qgaus recursively. 
{ 

float qgaus(float (*func)(float), float a, float b); 
float fl(float x); 

nrfunc=func; 

return qgaus(f1,xl,x2); 

> 

float fl(float x) This is H of eq. (4.6.5). 

{ 

float qgaus(float (*func)(float), float a, float b); 

float f2(float y); 

float yyl(float),yy2(float); 

xsav=x; 

return qgaus(f2,yyl(x),yy2(x)); 


float f2(float y) This is G of eq. (4.6.4). 

{ 

float qgaus(float (*func)(float), float a, float b); 
float f3(float z); 

float zl(float,float),z2(float,float); 
ysav=y; 

return qgaus(f3,zl(xsav,y),z2(xsav,y)); 


float f3(float z) The integrand f(x,y,z) evaluated at fixed x and y. 

{ 

return (*nrfunc)(xsav,ysav,z); 

> 


The necessary user-supplied functions have the following prototypes: 

float func(float x,float y,float z) ; The 3-dimensional function to be inte- 

float yyl (float x) ; grated. 

float yy2(float x); 

float zl(float x,float y); 

float z2(float x,float y); 


CITED REFERENCES AND FURTHER READING: 

Stroud, A.H. 1971, Approximate Calculation of Multiple Integrals (Englewood Cliffs, NJ: Prentice- 
Hall). [1] 

Dahlquist, G., and Bjorck, A. 1974, Numerical Methods (Englewood Cliffs, NJ: Prentice-Hall), 
§7.7, p. 318. 

Johnson, L.W., and Riess, R.D. 1982, Numerical Analysis, 2nd ed. (Reading, MA: Addison- 
Wesley), §6.2.5, p. 307. 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), equations 25.4.58ff. 
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Chapter 5. Evaluation of Functions 

5.0 Introduction 


The purpose of this chapter is to acquaint you with a selection of the techniques 
that are frequently used in evaluating functions. In Chapter 6, we will apply and 
illustrate these techniques by giving routines for a variety of specific functions. 
The purposes of this chapter and the next are thus mostly in harmony, but there 
is nevertheless some tension between them: Routines that are clearest and most 
illustrative of the general techniques of this chapter are not always the methods of 
choice for a particular special function. By comparing this chapter to the next one, 
you should get some idea of the balance between “general” and “special” methods 
that occurs in practice. 

Insofar as that balance favors general methods, this chapter should give you 
ideas about how to write your own routine for the evaluation of a function which, 
while “special” to you, is not so special as to be included in Chapter 6 or the 
standard program libraries. 


CITED REFERENCES AND FURTHER READING: 

Fike, C.T. 1968, Computer Evaluation of Mathematical Functions (Englewood Cliffs, NJ: Prentice- 
Hall). 

Lanczos, C. 1956, Applied Analysis; reprinted 1988 (New York: Dover), Chapter 7. 


5.1 Series and Their Convergence 

Everybody knows that an analytic function can be expanded in the neighborhood 
of a point xq in a power series, 


f(x) = ^2a k (x- x 0 ) k (5.1.1) 

fe=o 

Such series are straightforward to evaluate. You don’t, of course, evaluate the A;th 
power ofx — xo ab initio for each term; rather you keep the k — 1st power and update 
it with a multiply. Similarly, the form of the coefficients a is often such as to make 
use of previous work: Terms like k\ or (2k')l can be updated in a multiply or two. 
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How do you know when you have summed enough terms? In practice, the 
terms had better be getting small fast, otherwise the series is not a good technique 
to use in the first place. While not mathematically rigorous in all cases, standard 
practice is to quit when the term you have just added is smaller in magnitude than 
some small e times the magnitude of the sum thus far accumulated. (But watch out 
if isolated instances of ak = 0 are possible!). 

A weakness of a power series representation is that it is guaranteed not to 
converge farther than that distance from xo at which a singularity is encountered 
in the complex plane. This catastrophe is not usually unexpected: When you find 
a power series in a book (or when you work one out yourself), you will generally 
also know the radius of convergence. An insidious problem occurs with series that 
converge everywhere (in the mathematical sense), but almost nowhere fast enough 
to be useful in a numerical method. Two familiar examples are the sine function 
and the Bessel function of the first kind. 


= = £ 


(-i) fc 

(2k + 1)!' 


W = (!)”£ 


< \-rf. 

k\(k + n) 


(5.1.2) 

(5.1.3) 


Both of these series converge for all x. But both don’t even start to converge 
until k |aj; before this, their terms are increasing. This makes these series 
useless for large x. 

Accelerating the Convergence of Series 


There are several tricks for accelerating the rate of convergence of a series (or, 
equivalently, of a sequence of partial sums). These tricks will not generally help in 
cases like (5.1.2) or (5.1.3) while the size of the terms is still increasing. For series 
with terms of decreasing magnitude, however, some accelerating methods can be 
startlingly good. Aitken’s S 2 -process is simply a formula for extrapolating the partial 
sums of a series whose convergence is approximately geometric. If S n -i, S n , S n+ 1 
are three successive partial sums, then an improved estimate is 


S'n 


= Sn +1 — 


(S n +1 ~ S n ) 2 

Sn+1 - 2 S n + S n _! 


(5.1.4) 



You can also use (5.1.4) with n + 1 and n — 1 replaced by n + p and n — p ®|2, 

respectively, for any integer p. If you form the sequence of 5-’s, you can apply S.g-1 

(5.1.4) a second time to that sequence, and so on. (In practice, this iteration will i 

only rarely do much for you after the first stage.) Note that equation (5.1.4) should 
be computed as written; there exist algebraically equivalent forms that are much 
more susceptible to roundoff error. 


For alternating series (where the terms in the sum alternate in sign), Euler’s 
transformation can be a powerful tool. Generally it is advisable to do a small 
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number of terms directly, through term n — 1 say, then apply the transformation to 
the rest of the series beginning with term n. The formula (for n even) is 

^(-l) 5 u s = u 0 - U! + u 2 ■ ■ ■ - u n - 1 + ^ [A s «n] (5.1.5) 

5=0 5=0 

Here A is the forward difference operator, i.e., 

A u n = u n+ 1 - u n 

A 2 u n = u n+2 - 2u n+1 + u n (5.1.6) 

A 3 u n = u n+ 3 - 3u n+2 + 3u n+ i - u n etc. 


Of course you don’t actually do the infinite sum on the right-hand side of (5.1.5), 
but only the first, say, p terms, thus requiring the first p differences (5.1.6) obtained 
from the terms starting at u n . 

Euler’s transformation can be applied not only to convergent series. In some 
cases it will produce accurate answers from the first terms of a series that is formally 
divergent. It is widely used in the summation of asymptotic series. In this case 
it is generally wise not to sum farther than where the terms start increasing in 
magnitude; and you should devise some independent numerical check that the results 
are meaningful. 

There is an elegant and subtle implementation of Euler’s transformation due 
to van Wijngaarden [1 ]: It incorporates the terms of the original alternating series 
one at a time, in order. For each incorporation it either increases p by 1, equivalent 
to computing one further difference (5.1.6), or else retroactively increases n by 1, 
without having to redo all the difference calculations based on the old n value! The 
decision as to which to increase, n or p, is taken in such a way as to make the 
convergence most rapid. Van Wijngaarden’s technique requires only one vector of 
saved partial differences. Here is the algorithm: 

#include <math.h> 

void eulsum(float *sum, float term, int jterm, float wksp[]) 

Incorporates into sum the jterm'th term, with value term, of an alternating series, sum is 
input as the previous partial sum, and is output as the new partial sum. The first call to this 
routine, with the first term in the series, should be with jterm=l. On the second call, term 
should be set to the second term of the series, with sign opposite to that of the first call, and 
jterm should be 2. And so on. wksp is a workspace array provided by the calling program, 
dimensioned at least as large as the maximum number of terms to be incorporated. 

{ 

int j; 

static int nterm; 

float tmp,dum; 


if (jterm == 1) { 
nterm=l; 

*sum=0.5*(wksp[1]=term); 

> else { 

tmp=wksp [1] ; 
wksp[1]=term; 

for (j=l;j<=nterm-l;j++) { 
dum=wksp[j+l] ; 


Initialize: 

Number of saved differences in wksp. 
Return first estimate. 


Update saved quantities by van Wijn- 
gaarden's algorithm. 
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wksp[j+1] =0.5*(wksp[j] +tmp) ; 
tmp=dum; 

> 

wksp [nterm+1]=0.5*(wksp[nterm]+tmp); 

if (fabs (wksp [nterm+1]) <= f abs(wksp [nterm] )) Favorable to increase p, 

*sum += (0.5*wksp[++nterm]); and the table becomes longer. 

else Favorable to increase n, 

*sum += wksp [nterm+1] ; the table doesn’t become longer. 

> 

> 


The powerful Euler technique is not directly applicable to a series of positive 
terms. Occasionally it is useful to convert a series of positive terms into an alternating 
series, just so that the Euler transformation can be used! Van Wijngaarden has given 
a transformation for accomplishing this [1]: 


where 


XA = 5Z( -1 ) r lwr (5- 1 - 7 ) 

r=l r=l 

W r = V r + 2V2r + 4U4 r + 8VSr H- (5.1.8) 


Equations (5.1.7) and (5.1.8) replace a simple sum by a two-dimensional sum, each 
term in (5.1.7) being itself an infinite sum (5.1.8). This may seem a strange way to 
save on work! Since, however, the indices in (5.1.8) increase tremendously rapidly, 
as powers of 2, it often requires only a few terms to converge (5.1.8) to extraordinary 
accuracy. You do, however, need to be able to compute the v r ’s efficiently for 
“random” values r. The standard “updating” tricks for sequential r’s, mentioned 
above following equation (5.1.1), can’t be used. 

Actually, Euler’s transformation is a special case of a more general transforma¬ 
tion of power series. Suppose that some known function g(z) has the series 


g(z) = )Tb n z n (5.1.9) 

71=0 

and that you want to sum the new, unknown, series 

f( z ) = Y2 c nb„z n (5.1.10) 

71=0 

Then it is not hard to show (see [2]) that equation (5.1.10) can be written as 

/(» = ^[A^c 0 ] 5 , (5.1.11) 

77=0 H - 

which often converges much more rapidly. Here A <n h:,Q is the nth finite-difference 
operator (equation 5.1.6), with A (°)cq = co, and g l ' n} is the nth derivative of g(z). 
The usual Euler transformation (equation 5.1.5 with n = 0) can be obtained, for 
example, by substituting 



9( z )=Y^- z = l ~ z + z2 - z3 + --- 


(5.1.12) 
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into equation (5.1.11), and then setting z = 1. 

Sometimes you will want to compute a function from a series representation 
even when the computation is not efficient. For example, you may be using the values 
obtained to fit the function to an approximating form that you will use subsequently 
(cf. §5.8). If you are summing very large numbers of slowly convergent terms, pay 
attention to roundoff errors! In floating-point representation it is more accurate to 
sum a list of numbers in the order starting with the smallest one, rather than starting 
with the largest one. It is even better to group terms pairwise, then in pairs of pairs, 
etc., so that all additions involve operands of comparable magnitude. 

CITED REFERENCES AND FURTHER READING: 

Goodwin, E.T. (ed.) 1961, Modern Computing Methods , 2nd ed. (New York: Philosophical Li¬ 
brary), Chapter 13 [van Wijngaarden’s transformations]. [1] 

Dahlquist, G., and Bjorck, A. 1974, Numerical Methods (Englewood Cliffs, NJ: Prentice-Hall), 
Chapter 3. 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), §3.6. 

Mathews, J., and Walker, R.L. 1970, Mathematical Methods of Physics, 2nd ed. (Reading, MA: 
W.A. Benjamin/Addison-Wesley), §2.3. [2] 


5.2 Evaluation of Continued Fractions 


Continued fractions are often powerful ways of evaluating functions that occur 
in scientific applications. A continued fraction looks like this: 


f(x) = b 0 + 


Or 


h + 


a 2 



(5.2.1) 


Printers prefer to write this as 


f(x) =b 0 + 


g i 
bi + 


Q2 

&2 + 


&3 &4 (Z 5 

63 + 64 + 65 + 


(5.2.2) 


In either (5.2.1) or (5.2.2), the a’s and b’s can themselves be functions of x, usually 
linear or quadratic monomials at worst (i.e., constants times x or times a: 2 ). For 
example, the continued fraction representation of the tangent function is 


tana; 


IS 


3- 


5- 



(5.2.3) 


Continued fractions frequently converge much more rapidly than power series 
expansions, and in a much larger domain in the complex plane (not necessarily 
including the domain of convergence of the series, however). Sometimes the 
continued fraction converges best where the series does worst, although this is not 



S, § g 
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a general rule. Blanch [1 ] gives a good review of the most useful convergence tests 
for continued fractions. 

There are standard techniques, including the important quotient-difference algo¬ 
rithm, for going back and forth between continued fraction approximations, power 
series approximations, and rational function approximations. Consult Acton [2] for 
an introduction to this subject, and Fike [3] for further details and references. 

How do you tell how far to go when evaluating a continued fraction? Unlike 
a series, you can’t just evaluate equation (5.2.1) from left to right, stopping when 
the change is small. Written in the form of (5.2.1), the only way to evaluate the 
continued fraction is from right to left, first (blindly!) guessing how far out to 
start. This is not the right way. 

The right way is to use a result that relates continued fractions to rational 
approximations, and that gives a means of evaluating (5.2.1) or (5.2.2) from left 
to right. Let f n denote the result of evaluating (5.2.2) with coefficients through 
a n and b n . Then 

A-1 < 5 - 2 - 4 ' 

where A n and B n are given by the following recurrence: 

A- i = 1 B_\ = 0 

Ao = bo Bo = 1 

Aj = bjAj—i + ajAj—2 Bj = bjBj—i + a,jBj —2 j = 1,2..... u 

(5.2.5) 

This method was invented by J. Wallis in 1655 (!), and is discussed in his Arithmetica 
Infinitorum [4], You can easily prove it by induction. 

In practice, this algorithm has some unattractive features: The recurrence (5.2.5) 
frequently generates very large or very small values for the partial numerators and 
denominators Aj and Bj. There is thus the danger of overflow or underflow of the 
floating-point representation. However, the recurrence (5.2.5) is linear in the A’s and 
.B’s. At any point you can rescale the currently saved two levels of the recurrence, 
e.g., divide Aj,Bj,Aj- 1 , and Bj -1 all by Bj. This incidentally makes Aj = fj 
and is convenient for testing whether you have gone far enough: See if fj and fj- 1 
from the last iteration are as close as you would like them to be. (If Bj happens to 
be zero, which can happen, just skip the renormalization for this cycle. A fancier 
level of optimization is to renormalize only when an overflow is imminent, saving 
the unnecessary divides. All this complicates the program logic.) 

Two newer algorithms have been proposed for evaluating continued fractions. 
Steed’s method does not use Aj and Bj explicitly, but only the ratio Dj = Bj -1 /Bj. 
One calculates Dj and A fj = fj — fj -1 recursively using 


Dj — 1 /{bj + ajDj— 1 ) (5.2.6) 

A fj = (bjDj - 1)A fj-, (5.2.7) 



Steed’s method (see, e.g., [5]) avoids the need for rescaling of intermediate results. 
However, for certain continued fractions you can occasionally run into a situation 
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where the denominator in (5.2.6) approaches zero, so that Dj and A fj are very 
large. The next A/ ;+1 will typically cancel this large change, but with loss of 
accuracy in the numerical running sum of the //s. It is awkward to program around 
this, so Steed’s method can be recommended only for cases where you know in 
advance that no denominator can vanish. We will use it for a special purpose in 
the routine bessik (§6.7). 

The best general method for evaluating continued fractions seems to be the 
modified Lentz’s method [6], The need for rescaling intermediate results is avoided 
by using both the ratios 

Cj — Aj/Aj_i, Dj = (5.2.8) 

and calculating fj by 

fj = fj-iCjDj (5.2.9) 

From equation (5.2.5), one easily shows that the ratios satisfy the recurrence relations 

Dj = 1 /(hj + a,jDj— i), Cj = bj + ctj/Cj. i (5.2.10) 

In this algorithm there is the danger that the denominator in the expression for D j, 
or the quantity Cj itself, might approach zero. Either of these conditions invalidates 
(5.2.10). However, Thompson and Barnett [5] show how to modify Lentz’s algorithm 
to fix this: Just shift the offending term by a small amount, e.g., 10 “ 30 . If you 
work through a cycle of the algorithm with this prescription, you will see that fj+i 
is accurately calculated. 

In detail, the modified Lentz’s algorithm is this: 

• Set /o = bo', if bo = 0 set /o = tiny. 

• Set C 0 = fo- 

• Set Do = 0. 

• For j = 1,2,... 

Set Dj = bj + (ij itj i. 

If Dj = 0, set Dj = tiny. 

Set Cj = bj + cbj/Cj— i. 

If Cj = 0 set Cj = tiny. 

Set Dj = 1 /Dj. 

Set A j = CjDj. 

Set fj = fj-iAj. 

If | A j — 11 < eps then exit. 

Here eps is your floating-point precision, say 10 ~ 7 or 10 -15 . The parameter tiny 
should be less than typical values of eps\bj\, say 10 -30 . 

The above algorithm assumes that you can terminate the evaluation of the 
continued fraction when f j — /j_i| is sufficiently small. This is usually the case, 
but by no means guaranteed. Jones U\ gives a list of theorems that can be used to 
justify this termination criterion for various kinds of continued fractions. 

There is at present no rigorous analysis of error propagation in Lentz’s algorithm. 
However, empirical tests suggest that it is at least as good as other methods. 
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Manipulating Continued Fractions 


Several important properties of continued fractions can be used to rewrite them 
in forms that can speed up numerical computation. An equivalence transformation 


a n —> A a n , b n —> A b n , a n +i —> \a n+ i (5.2.11) 


leaves the value of a continued fraction unchanged. By a suitable choice of the scale 
factor A you can often simplify the form of the a’s and the b’s. Of course, you 
can carry out successive equivalence transformations, possibly with different A’s, on 
successive terms of the continued fraction. 

The even and odd parts of a continued fraction are continued fractions whose 
successive convergents are f2 n and f‘2n+i, respectively. Their main use is that they 
converge twice as fast as the original continued fraction, and so if their terms are not 
much more complicated than the terms in the original there can be a big savings in 
computation. The formula for the even part of (5.2.2) is 


/even — do + 


d\ + 


C2 

cfe + 


(5.2.12) 


where in terms of intermediate variables 



Un 

a n = -—:-, 

n> 2 



b„b „-1 ’ 



do = bo, 

ci = a±, di = 

1 + OL2 


Ot2n-lOl2n- 

-2, d„ = 1 + OL2n- 

1 + OL2n, 

n> 2 


(5.2.13) 


(5.2.14) 



You can find the similar formula for the odd part in the review by Blanch [1 ]. Often 
a combination of the transformations (5.2.14) and (5.2.11) is used to get the best 
form for numerical work. 

We will make frequent use of continued fractions in the next chapter. 


CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), §3.10. 

Blanch, G. 1964, SIAM Review, vol. 6, pp. 383-421. [1] 

Acton, F.S. 1970, Numerical Methods That Work, 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), Chapter 11. [2] 

Cuyt, A., and Wuytack, L. 1987, Nonlinear Methods in Numerical Analysis (Amsterdam: North- 
Holland), Chapter 1. 

Fike, C.T. 1968, Computer Evaluation of Mathematical Functions (Englewood Cliffs, NJ: Prentice- 
Hall), §§8.2, 10.4, and 10.5. [3] 

Wallis, J. 1695, in Opera Mathematica, vol. 1, p. 355, Oxoniae e Theatro Shedoniano. Reprinted 
by Georg Olms Verlag, Hildeshein, New York (1972). [4] 
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Thompson, I.J., and Barnett, A.R. 1986, Journal of Computational Physics, vol. 64, pp. 490-509. 
[5] 

Lentz, W.J. 1976, Applied Optics, vol. 15, pp. 668-671. [6] 

Jones, W.B. 1973, in Pade Approximants and Their Applications, P.R. Graves-Morris, ed. (Lon¬ 
don: Academic Press), p. 125. [7] 


5.3 Polynomials and Rational Functions 

A polynomial of degree N is represented numerically as a stored array of 
coefficients, c [j] with j= 0,..., N. We will always take c [0] to be the constant 
term in the polynomial, c [TV] the coefficient of x N ; but of course other conventions 
are possible. There are two kinds of manipulations that you can do with a polynomial: 
numerical manipulations (such as evaluation), where you are given the numerical 
value of its argument, or algebraic manipulations, where you want to transform 
the coefficient array in some way without choosing any particular argument. Let’s 
start with the numerical. 

We assume that you know enough never to evaluate a polynomial this way: 

p=c [0] +c [1] *x+c [2] *x*x+c [3] *x*x*x+c [4] *x*x*x*x; 

or (even worse!), 

p=c[0] +c[1]*x+c[2]*pow(x,2.0)+c[3]*pow(x,3.0)+c[4]*pow(x,4.0); 

Come the (computer) revolution, all persons found guilty of such criminal 
behavior will be summarily executed, and their programs won’t be! It is a matter 
of taste, however, whether to write 

p=c [0] +x*(c [1] +x*(c [2] +x*(c [3] +x*c [4] ) )); 


or 

p= (((c [4] *x+c [3]) *x+c [2] )*x+c [1] )*x+c [0] ; 

If the number of coefficients c [0. . n] is large, one writes 

p=c [n] ; 

for(j=n-l;j>=0;j—) p=p*x+c[j]; 


or 


p=c [j=n] ; 

while (j>0) p=p*x+c[—j]; 


Another useful trick is for evaluating a polynomial P(x) and its derivative 
dP(x)/dx simultaneously: 

p=c [n] ; 
dp=o.o; 

for(j=n-l;j>=0;j—) {dp=dp*x+p; p=p*x+c[j];> 
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or 


p=c [j=n] ; 
dp=0.0; 

while (j>0) {dp=dp*x+p; p=p*x+c[—j];} 


which yields the polynomial as p and its derivative as dp. 

The above trick, which is basically synthetic division [1,2], generalizes to the 
evaluation of the polynomial and nd of its derivatives simultaneously: 

void ddpoly(float c[], int nc, float x, float pd[] , int nd) 

Given the nc+1 coefficients of a polynomial of degree nc as an array c[0. .nc] with c [0] 
being the constant term, and given a value x, and given a value nd>l, this routine returns the 
polynomial evaluated at x as pd[0] and nd derivatives as pd[l. .nd] . 

{ 

int nnd,j,i; 

float cnst=1.0; 

pd [0] =c [nc] ; 

for (j=l;j<=nd;j++) pd[j]=0.0; 

for (i=nc-l;i>=0;i—) { 

nnd=(nd < (nc-i) ? nd : nc-i); 
for (j=nnd;j>=l;j—) 

pd [ j]=pd[j]*x+pd[j-1]; 
pd [0] =pd [0] *x+c [i] ; 

> 

for (i=2;i<=nd;i++) { After the first derivative, factorial constants come in. 

cnst *= i; 
pd[i] *= cnst; 

> 

> 


As a curiosity, you might be interested to know that polynomials of degree 
n > 3 can be evaluated in fewer than n multiplications, at least if you are willing 
to precompute some auxiliary coefficients and, in some cases, do an extra addition. 
For example, the polynomial 

P{x) = ao + clix + a^x 2 + CL 3 X 3 + a^x 4 (5.3.1) 


where 014 > 0, can be evaluated with 3 multiplications and 5 additions as follows: 

P(x) = [{Ax + B ) 2 + Ax + C\ [{Ax + B ) 2 + D]+E (5.3.2) 
where A, B, C, D, and E are to be precomputed by 
A = (a 4 ) 1/4 


B = 


i-A 3 


4 A 3 
D = 3B 2 + i 


a\A — 2 a ,2 B 
A2 


C = - 2B - 6 B 2 - D 

A 2 

E = ao — B 4 — B 2 {C + D)-CD 



(5.3.3) 
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Fifth degree polynomials can be evaluated in 4 multiplies and 5 adds; sixth degree 
polynomials can be evaluated in 4 multiplies and 7 adds; if any of this strikes 
you as interesting, consult references [3-5], The subject has something of the same 
entertaining, if impractical, flavor as that of fast matrix multiplication, discussed 
in §2.11. 


Turn now to algebraic manipulations. You multiply a polynomial of degree 
n — 1 (array of range [0. .n-1]) by a monomial factor x — a by a bit of code 
like the following, 


c [n] =c [n-1] ; 

for (j=n-l;j>=l;j—) c [j]=c [j-1]-c [j] *a; 
c[0] *= (-a); 


Likewise, you divide a polynomial of degree n by a monomial factor x — a 
(synthetic division again) using 


rem=c[n]; 
c[n]=0.0; 

for(i=n-l;i>=0;i—) { 
swap=c[i]; 
c[i]=rem; 
rem=swap+rem*a; 

> 


which leaves you with a new polynomial array and a numerical remainder rem. 

Multiplication of two general polynomials involves straightforward summing 
of the products, each involving one coefficient from each polynomial. Division of 
two general polynomials, while it can be done awkwardly in the fashion taught using 
pencil and paper, is susceptible to a good deal of streamlining. Witness the following 
routine based on the algorithm in [3], 


void poldivffloat u[] , int n, float v[], int nv, float q[] , float r[]) 

Given the n+1 coefficients of a polynomial of degree n in u[0. .n] , and the nv+1 coefficients 
of another polynomial of degree nv in v[0. .nv] , divide the polynomial u by the polynomial 
v (“u"/"v") giving a quotient polynomial whose coefficients are returned in q[0. .n], and a 
remainder polynomial whose coefficients are returned in r[0. .n]. The elements r[nv. .n] 
and q[n-nv+l. .n] are returned as zero. 

{ 

int k,j; 

for (j=0;j<=n;j++) { 
r[j]=u[j] ; 
q[j] =0.0; 

> 

for (k=n-nv;k>=0;k—) { 
q [k] =r [nv+k] /v [nv] ; 

for (j =nv+k-l;j>=k;j—) r[j] -= q[k]*v[j-k] ; 

> 

for (j=nv;j<=n;j++) r[j]=0.0; 
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Rational Functions 

You evaluate a rational function like 


Pfijx) _ Po + y\x -I- jh p^ 

Q v {x ) qo + qix-\ - 1 -q v x v 


in the obvious way, namely as two separate polynomials followed by a divide. As 
a matter of convention one usually chooses go — 1, obtained by dividing numerator 
and denominator by any other q 0 . It is often convenient to have both sets of 
coefficients stored in a single array, and to have a standard function available for 
doing the evaluation: 



double ratval(double x, double cof [], int mm, int kk) 

Given mm, kk, and cof [0 . .mm+kk] , evaluate and return the rational function (cof [0] + 
cof [1] x + • • • + cof [mm] x mm )/(l + cof [mm+1] x + • ■ ■ + cof [mm+kk] x^). 

{ 

int j; 

double sumd.sumn; Note precision! Change to float if desired. 

for (sumn=cof[mm],j=mm-l;j>=0;j—) sumn=sumn*x+cof[j]; 
for (sumd=0.0,j=mm+kk;j>=mm+l;j—) sumd=(sumd+cof[j] )*x; 
return sumn/(1.0+sumd); 

> 


CITED REFERENCES AND FURTHER READING: 

Acton, F.S. 1970, Numerical Methods That Work, 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), pp. 183, 190. [1] 

Mathews, J., and Walker, R.L. 1970, Mathematical Methods of Physics, 2nd ed. (Reading, MA: 

W.A. Benjamin/Addison-Wesley), pp. 361-363. [2] 

Knuth, D.E. 1981, Seminumerical Algorithms, 2nd ed., vol. 2 of The Art of Computer Programming 
(Reading, MA: Addison-Wesley), §4.6. [3] 

Fike, C.T. 1968, Computer Evaluation of Mathematical Functions (Englewood Cliffs, NJ: Prentice- 
Hall), Chapter 4. 

Winograd, S. 1970, Communications on Pure and Applied Mathematics, vol. 23, pp. 165-179. [4] 
Kronsjo, L. 1987, Algorithms: Their Complexity and Efficiency, 2nd ed. (New York: Wiley). [5] 



5.4 Complex Arithmetic 

As we mentioned in §1.2, the lack of built-in complex arithmetic in C is a 
nuisance for numerical work. Even in languages like FORTRAN that have complex 
data types, it is disconcertingly common to encounter complex operations that 
produce overflows or underflows when both the complex operands and the complex 
result are perfectly representable. This occurs, we think, because software companies 
assign inexperienced programmers to what they believe to be the perfectly trivial 
task of implementing complex arithmetic. 
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Actually, complex arithmetic is not quite trivial. Addition and subtraction 
are done in the obvious way, performing the operation separately on the real and 
imaginary parts of the operands. Multiplication can also be done in the obvious way, 
with 4 multiplications, one addition, and one subtraction, 


(a + ib)(c + id) = (ac — bd) + i(bc + ad) (5.4.1) 

(the addition before the i doesn’t count; it just separates the real and imaginary parts 
notationally). But it is sometimes faster to multiply via 

(a + ib)(c+ id) = (ac — bd) + i[(a + b)(c+ d) — ac — bd\ (5.4.2) 

which has only three multiplications (ac, bd, (a + b)(c + d)), plus two additions and 
three subtractions. The total operations count is higher by two, but multiplication 
is a slow operation on some machines. 

While it is true that intermediate results in equations (5.4.1) and (5.4.2) can 
overflow even when the final result is representable, this happens only when the final 
answer is on the edge of representability. Not so for the complex modulus, if you 
are misguided enough to compute it as 

\a + ib\ = \/a 2 +b 2 (bad!) (5.4.3) 


whose intermediate result will overflow if either a or b is as large as the square 
root of the largest representable number (e.g., 10 19 as compared to 10 38 ). The right 
way to do the calculation is 


|a + ib\ 


My/ 1 +■ (6/ a) 2 I a > \b\ 
\bWl + (a/by |a|<|6| 


(5.4.4) 


Complex division should use a similar trick to prevent avoidable overflows, 
underflow, or loss of precision, 


! [a + b(d/c)\ + i[b - a(d/c)\ 
c + d(d/c) 

\a(c/d) + b] + i[b(c/d) — a] 
c(c/d) + d 


(5.4.5) 


Of course you should calculate repeated subexpressions, like c/d or d/c, only once. 

Complex square root is even more complicated, since we must both guard 
intermediate results, and also enforce a chosen branch cut (here taken to be the 
negative real axis). To take the square root of c + id, first compute 


0 

c = d = 0 



/ l + VHWcf |C|>M 

(5.4.6) 
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Then the answer is 


0 

to = 0 


/ d \ 




w ^ 0, c > 0 


\d\ . 

7T- + lw 

2 to 

to ^ 0 , c < 0 , d ; 

> 0 

\d\ . 

—— iw 

2 to 

to ^ 0, c < 0, d < 

co 


Vc+id= < |d| 


Routines implementing these algorithms are listed in Appendix C. 


CITED REFERENCES AND FURTHER READING: 

Midy, P., and Yakovlev, Y. 1991, Mathematics and Computers in Simulation, vol. 33, pp. 33-49. 
Knuth, D.E. 1981, Seminumerical Algorithms, 2nd ed., vol. 2 of The Art of Computer Programming 
(Reading, MA: Addison-Wesley) [see solutions to exercises 4.2.1.16 and 4.6.4.41]. 


5.5 Recurrence Relations and Clenshaw’s 
Recurrence Formula 

Many useful functions satisfy recurrence relations, e.g., 


'P n +1 

i(a;) = (2n+ l)a:P n (a:) - 


x) (5.5.1) 

Jn+ 

1 ( x )=-Jn(x)-J n -li 

X 

[x) 

(5.5.2) 

n.j 

E n+ i(x) = e~ x - xE n (x 

) 

(5.5.3) 

10 = 

2 cos 6 cos (n — 1)9 — cosi 

[n — 2)9 

(5.5.4) 

i9 = 

2 cos 9 sin(n — 1)9 — sin( 

n- 2)0 

(5.5.5) 

tions 

are Legendre polynomials 

, Bessel f 

unctions of the first 


kind, and exponential integrals, respectively. (For notation see [1].) These relations 
are useful for extending computational methods from two successive values of n to 
other values, either larger or smaller. 

Equations (5.5.4) and (5.5.5) motivate us to say a few words about trigonometric 
functions. If your program’s running time is dominated by evaluating trigonometric 
functions, you are probably doing something wrong. Trig functions whose arguments 
form a linear sequence 9 = 9q + nS, n = 0,1,2,..., are efficiently calculated by 
the following recurrence, 

cos(0 + 5) = cos 0 — [a cos 0 + f3 sin 9] 
sin(# + 5) = sin 9 — [a sin 9-/3 cos 6\ 
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where a and (3 are the precomputed coefficients 

a = 2sin 2 ^^ /? = sinc> (5.5.7) 

The reason for doing things this way, rather than with the standard (and equivalent) 
identities for sums of angles, is that here a and (3 do not lose significance if the 
incremental S is small. Likewise, the adds in equation (5.5.6) should be done in 
the order indicated by square brackets. We will use (5.5.6) repeatedly in Chapter 
12, when we deal with Fourier transforms. 

Another trick, occasionally useful, is to note that both sin 9 and cos 9 can be 
calculated via a single call to tan: 


f = tan (0 cos0=)j-p-^ sin ( 5 - 5 - 8 ) 

The cost of getting both sin and cos, if you need them, is thus the cost of tan plus 
2 multiplies, 2 divides, and 2 adds. On machines with slow trig functions, this can 
be a savings. However, note that special treatment is required if 9 —> ±7r. And also 
note that many modern machines have very fast trig functions; so you should not 
assume that equation (5.5.8) is faster without testing. 

Stability of Recurrences 

You need to be aware that recurrence relations are not necessarily stable 
against roundoff error in the direction that you propose to go (either increasing n or 
decreasing n). A three-term linear recurrence relation 

Vn+i + a n y n + b n y n -! = 0, n= 1,2,... (5.5.9) 

has two linearly independent solutions, f n and g n say. Only one of these corresponds 
to the sequence of functions f n that you are trying to generate. The other one g n 
may be exponentially growing in the direction that you want to go, or exponentially 
damped, or exponentially neutral (growing or dying as some power law, for example). 
If it is exponentially growing, then the recurrence relation is of little or no practical 
use in that direction. This is the case, e.g., for (5.5.2) in the direction of increasing 
n, when x < n. You cannot generate Bessel functions of high n by forward 
recurrence on (5.5.2). 

To state things a bit more formally, if 

fn/dn —* 0 as n —> oo (5.5.10) 

then f n is called the minimal solution of the recurrence relation (5.5.9). Nonminimal 
solutions like g n are called dominant solutions. The minimal solution is unique, if it 
exists, but dominant solutions are not — you can add an arbitrary multiple of f n to 
a given g n . You can evaluate any dominant solution by forward recurrence, but not 
the minimal solution. (Unfortunately it is sometimes the one you want.) 

Abramowitz and Stegun (in their Introduction) [1 ] give a list of recurrences that 
are stable in the increasing or decreasing directions. That list does not contain all 
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possible formulas, of course. Given a recurrence relation for some function f n (x ) 
you can test it yourself with about five minutes of (human) labor: For a fixed x 
in your range of interest, start the recurrence not with true values of fj(x) and 
fj + i (x), but (first) with the values 1 and 0 , respectively, and then (second) with 
0 and 1, respectively. Generate 10 or 20 terms of the recursive sequences in the 
direction that you want to go (increasing or decreasing from j ), for each of the two 
starting conditions. Look at the difference between the corresponding members of 
the two sequences. If the differences stay of order unity (absolute value less than 
10, say), then the recurrence is stable. If they increase slowly, then the recurrence 
may be mildly unstable but quite tolerably so. If they increase catastrophically, then 
there is an exponentially growing solution of the recurrence. If you know that the 
function that you want actually corresponds to the growing solution, then you can 
keep the recurrence formula anyway e.g., the case of the Bessel function Y n {x) for 
increasing n, see §6.5; if you don’t know which solution your function corresponds 
to, you must at this point reject the recurrence formula. Notice that you can do this 
test before you go to the trouble of finding a numerical method for computing the 
two starting functions fj(x) and fj + i(x): stability is a property of the recurrence, 
not of the starting values. 

An alternative heuristic procedure for testing stability is to replace the recur¬ 
rence relation by a similar one that is linear with constant coefficients. For example, 
the relation (5.5.2) becomes 

y n +i - 2-yy n + y n -i = 0 (5.5.11) 

where 7 = n/x is treated as a constant. You solve such recurrence relations 
by trying solutions of the form y„ = a n . Substituting into the above recur¬ 
rence gives 

a 2 — 270+1 = 0 or a = 'y±s/'y 2 — l (5.5.12) 

The recurrence is stable if |o| < 1 for all solutions a. This holds (as you can verify) 
if | 7 | < 1 or n < x. The recurrence (5.5.2) thus cannot be used, starting with Jo (a:) 
and Ji(a:), to compute J n (x) for large n. 

Possibly you would at this point like the security of some real theorems on 
this subject (although we ourselves always follow one of the heuristic procedures). 
Here are two theorems, due to Perron [2]: 

Theorem A. If in (5.5.9) a n ~ an 01 , b n ~ bn 13 as n —> 00 , and /3 < 2a, then 

g n +i/9n ~ ~an a , fn+i/fn ~ -(b/a)n 0 ~ a (5.5.13) 

and f n is the minimal solution to (5.5.9). 

Theorem B. Under the same conditions as Theorem A, but with (3 = 2a, 
consider the characteristic polynomial 

t 2 + at + b = 0 (5.5.14) 



If the roots t\ and £2 of (5.5.14) have distinct moduli, \t\\ > \t 2 say, then 

9n+l/9n ~ tm a , fn+l/fn ~ t 2 n a 


(5.5.15) 
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and f n is again the minimal solution to (5.5.9). Cases other than those in these 
two theorems are inconclusive for the existence of minimal solutions. (For more 
on the stability of recurrences, see [3].) 

How do you proceed if the solution that you desire is the minimal solution? The 
answer lies in that old aphorism, that every cloud has a silver lining: If a recurrence 
relation is catastrophically unstable in one direction, then that (undesired) solution 
will decrease very rapidly in the reverse direction. This means that you can start 
with any seed values for the consecutive fj and fj+i and (when you have gone 
enough steps in the stable direction) you will converge to the sequence of functions 
that you want, times an unknown normalization factor. If there is some other way 
to normalize the sequence (e.g., by a formula for the sum of the /„’s), then this 
can be a practical means of function evaluation. The method is called Miller’s 
algorithm. An example often given [1,4] uses equation (5.5.2) in just this way, along 
with the normalization formula 


1 = J 0 (s) + 2 J 2 (x) + 2 J 4 (s) + 2 J 6 (a) + • • • (5.5.16) 


Incidentally, there is an important relation between three-term recurrence 
relations and continued fractions. Rewrite the recurrence relation (5.5.9) as 


Un _ _ b n 

y n -1 a n + y n+1 /y n 


(5.5.17) 


Iterating this equation, starting with n, gives 


Vn _ b n b n +i 

Un— 1 &n+l 


(5.5.18) 


Pincherle’s Theorem [2] tells us that (5.5.18) converges if and only if (5.5.9) has a 
minimal solution f n , in which case it converges to f n /fn- 1 - This result, usually for 
the case n = 1 and combined with some way to determine /o, underlies many of the 
practical methods for computing special functions that we give in the next chapter. 


Clenshaw’s Recurrence Formula 


Clenshaw’s recurrence formula [5] is an elegant and efficient way to evaluate a 
sum of coefficients times functions that obey a recurrence formula, e.g., 

N N 

m =£ Cfc cos k6 or f(x) = ^c k Pk(x) 
fc=o fc=o 

Here is how it works: Suppose that the desired sum is 

N 

/(*) = »,TM*) (5.5.19) 

fc=o 

and that F k obeys the recurrence relation 



F n+ i(x) = a(n,x)F n (x) +.0(n, x)F n _i(x) 


(5.5.20) 
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for some functions a(n,x) and 8(n, x). Now define the quantities y k (k = 
N, N — 1,..., 1) by the following recurrence: 


J/JV+2 = Vn+i = 0 

2/fc = a(k, x)yu+ i + /3(fc + 1 ,x)yk +2 + Ck 


{k = N,N- 


(5.5.21) 


If you solve equation (5.5.21) for Ck on the left, and then write out explicitly the 
sum (5.5.19), it will look (in part) like this: 

f(x) = ■■■ 



+ [ 2/8 - a(8,x)y 9 - 
+ [?/7 - a(7, x)y 8 - 
+ [ 2/6 - a(6,x)y 7 - 


P(9,x)y 10 ]Fg(x) 
P(8,x)y 9 ]F 7 (x) 
/3(7, x)yg\F 6 (x) 


g§ i 8 

00-S- a 3 


&■< ® ® 


+ [ 2/5 - a(5, x)y 6 - /?(6, x)y 7 ]F 5 (x) 

(5.5.22) 

|l|l 

H- 


Jo 0 ^' 

ill 1 

+ [ 3/2 - oi(2, x)y 3 - /3(3, x)yf\F 2 {x) 


0 « 

0 w' ^ 

+ [ 2/1 -oe(l,x)y 2 - /3(2,x)y 3 ]F 1 (x) 


m ^ ? 1 

+ [c o + 0(l,x)y 2 - 0(1, x)y 2 ]F 0 (x) 


S s 30 

1 III- 


Notice that we have added and subtracted /?(1, x)y 2 in the last line. If you examine 
the terms containing a factor of y s in (5.5.22), you will find that they sum to zero as 
a consequence of the recurrence relation (5.5.20); similarly all the other y ^’s down 
through t/ 2 - The only surviving terms in (5.5.22) are 

f(x) = /3(l,x)F 0 (x)y 2 + F\(x)y\ + F 0 (a;)co (5.5.23) 

Equations (5.5.21) and (5.5.23) are Clenshaw’s recurrence formula for doing the 
sum (5.5.19): You make one pass down through the yu s using (5.5.21); when you 
have reached y 2 and y\ you apply (5.5.23) to get the desired answer. 

Clenshaw’s recurrence as written above incorporates the coefficients Ck in a 
downward order, with k decreasing. At each stage, the effect of all previous c/j’s 
is “remembered” as two coefficients which multiply the functions Fk+i and Fk 
(ultimately Fo and F-\). If the functions Fk are small when k is large, and if the 
coefficients Ck are small when k is small, then the sum can be dominated by small 
Fk s. In this case the remembered coefficients will involve a delicate cancellation 
and there can be a catastrophic loss of significance. An example would be to sum 
the trivial series 



Ji 5 (l) = 0 x J 0 (l) + 0 x Ji(l) + ... + 0 x J 14 ( 1) + 1 x Ji 5 (l) (5.5.24) 

Here J 15 , which is tiny, ends up represented as a canceling linear combination of 
Jo and Ji, which are of order unity. 
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The solution in such cases is to use an alternative Clenshaw recurrence that 
incorporates Cfc’s in an upward direction. The relevant equations are 


Ayk -2 - a(k,x)y k ~i - c k ], 


V -2 =y -1 = 0 (5.5.25) 

1 

Vk ~ p(k+l,x) 1 

(A: = 0.1...., N — 1) (5.5.26) 

/(a:) = c n F n (x) - (3(N,x)F N _ 1 (x)y N -i - F N (x)y N - 2 (5.5.27) 


The rare case where equations (5.5.25)—(5.5.27) should be used instead of 
equations (5.5.21) and (5.5.23) can be detected automatically by testing whether 
the operands in the first sum in (5.5.23) are opposite in sign and nearly equal in 
magnitude. Other than in this special case, Clenshaw’s recurrence is always stable, 
independent of whether the recurrence for the functions F^ is stable in the upward 
or downward direction. 


CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), pp. xiii, 697. [1] 

Gautschi, W. 1967, SIAM Review, vol. 9, pp. 24-82. [2] 

Lakshmikantham, V., and Trigiante, D. 1988, Theory of Difference Equations: Numerical Methods 
and Applications (San Diego: Academic Press). [3] 

Acton, F.S. 1970, Numerical Methods That Work, 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), pp. 20ff. [4] 

Clenshaw, C.W. 1962, Mathematical Tables, vol. 5, National Physical Laboratory (London: H.M. 
Stationery Office). [5] 

Dahlquist, G., and Bjorck, A. 1974, Numerical Methods (Englewood Cliffs, NJ: Prentice-Hall), 
§4.4.3, p. 111. 

Goodwin, E.T. (ed.) 1961, Modern Computing Methods, 2nd ed. (New York: Philosophical Li¬ 
brary), p. 76. 


5.6 Quadratic and Cubic Equations 

The roots of simple algebraic equations can be viewed as being functions of the 
equations’ coefficients. We are taught these functions in elementary algebra. Yet, 
surprisingly many people don’t know the right way to solve a quadratic equation 
with two real roots, or to obtain the roots of a cubic equation. 

There are two ways to write the solution of the quadratic equation 

ax 2 + bx + c = 0 (5.6.1) 

with real coefficients a, b, c, namely 



—b ± y/b 2 — 4 ac 


x = 


2 a 


(5.6.2) 
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and 


2c 

— b± s/b 2 — 4ac 


(5.6.3) 


If you use either (5.6.2) or (5.6.3) to get the two roots, you are asking for trouble: 
If either a or c (or both) are small, then one of the roots will involve the subtraction 
of b from a very nearly equal quantity (the discriminant); you will get that root very 
inaccurately. The correct way to compute the roots is 

q = — ^ ^b + sgn (b)\/b 2 — 4acj (5.6.4) 

Then the two roots are 

x\ = - and X 2 = - (5.6.5) 

a q 

If the coefficients a, b, c, are complex rather than real, then the above formulas 
still hold, except that in equation (5.6.4) the sign of the square root should be 
chosen so as to make 


R e{b*\/b 2 - 4 ac) > 0 (5.6.6) 

where Re denotes the real part and asterisk denotes complex conjugation. 

Apropos of quadratic equations, this seems a convenient place to recall that 
the inverse hyperbolic functions sinh -1 and cosh -1 are in fact just logarithms of 
solutions to such equations, 


sinh 1 (x) = ln(x + \Jx 2 + l) (5.6.7) 

cosh -1 (a;) = ±ln(x + \J x 2 — l) (5.6.8) 

Equation (5.6.7) is numerically robust for x > 0. For negative x, use the symmetry 
sinh -1 (— x) = — sinh -1 (x). Equation (5.6.8) is of course valid only for x > 1. 

For the cubic equation 


x 3 + ax 2 + bx + c = 0 


(5.6.9) 


with real or complex coefficients a, b, c, first compute 


Q = 



and 


R = 


2 a 3 — 9 ab + 27c 
54 


(5.6.10) 


If Q and R are real (always true when a, b, c are real) and R 2 < Q 3 , then the cubic 
equation has three real roots. Find them by computing 



9 = arccos (R/\/Q^) 


(5.6.11) 
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in terms of which the three roots are 

= -2 v / Qcos - 2 (5.6.12) 

r— (0 — 2n\ a 

x 3 = -2VQcm(- —)-j 

(This equation first appears in Chapter VI of Francois Viete’s treatise “De emen- 
datione,” published in 1615!) 

Otherwise, compute 

r /-1V 3 

A = - I /? /> >2 - Q 3 J (5.6.13) 

where the sign of the square root is chosen to make 

Re(R*^R 2 - Q 3 ) > 0 (5.6.14) 


(asterisk again denoting complex conjugation). If Q and R are both real, equations 
(5.6.13)-(5.6.14) are equivalent to 


A = -sgn (R) [|i?| + .JPA - Q 3 ] 
where the positive square root is assumed. Next compute 


1/3 


R _ / Q/A (A? 0) 

\0 (A = 0) 

in terms of which the three roots are 

xi = (A + B) — - 

(the single real root when a,b,c are real) and 

x 2 = --(Jj + B) - - + i-^~{A - B) 
x 3 = ~\{A + B) - ^ — i-^~{A — B) 


(5.6.15) 

(5.6.16) 

(5.6.17) 

(5.6.18) 


(in that same case, a complex conjugate pair). Equations (5.6.13)-(5.6.16) are 
arranged both to minimize roundoff error, and also (as pointed out by A.J. Glassman) 
to ensure that no choice of branch for the complex cube root can result in the 
spurious loss of a distinct root. 

If you need to solve many cubic equations with only slightly different coeffi¬ 
cients, it is more efficient to use Newton’s method (§9.4). 


CITED REFERENCES AND FURTHER READING: 

Weast, R.C. (ed.) 1967, Handbook of Tables for Mathematics , 3rd ed. (Cleveland: The Chemical 
Rubber Co.), pp. 130-133. 

Pachner, J. 1983, Handbook of Numerical Analysis Applications (New York: McGraw-Hill), §6.1. 
McKelvey, J.P. 1984, American Journal of Physics, vol. 52, pp. 269-270; see also vol. 53, p. 775, 
and vol. 55, pp. 374-375. 
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5.7 Numerical Derivatives 


Imagine that you have a procedure which computes a function /(#), and now 
you want to compute its derivative f'(x). Easy, right? The definition of the 
derivative, the limit as h —> 0 of 


f( x ) 


f(x + h)~ f(x) 
h 


(5.7.1) 


practically suggests the program: Pick a small value h; evaluate f(x + h); you 
probably have f(x) already evaluated, but if not, do it too; finally apply equation 
(5.7.1). What more needs to be said? 

Quite a lot, actually. Applied uncritically, the above procedure is almost 
guaranteed to produce inaccurate results. Applied properly, it can be the right way 
to compute a derivative only when the function / is fiercely expensive to compute, 
when you already have invested in computing f(x), and when, therefore, you want 
to get the derivative in no more than a single additional function evaluation. In such 
a situation, the remaining issue is to choose h properly, an issue we now discuss: 

There are two sources of error in equation (5.7.1), truncation error and roundoff 
error. The truncation error comes from higher terms in the Taylor series expansion, 

f(x + h) = f(x) + hf\x) + ^ h 2 f"(x ) + ^h 3 f"'(x) + ■■■ (5.7.2) 


whence 


f(x + h)~ f(x) 

h 


= f + \hf" + -- 


(5.7.3) 


The roundoff error has various contributions. First there is roundoff error in h : 
Suppose, by way of an example, that you are at a point x = 10.3 and you blindly 
choose h = 0.0001. Neither x = 10.3 nor x + h = 10.30001 is a number with 
an exact representation in binary; each is therefore represented with some fractional 
error characteristic of the machine’s floating-point format, e m , whose value in single 
precision may be ~ 10 “ 7 . The error in the effective value of h, namely the difference 
between x + h and x as represented in the machine, is therefore on the order of e m x, 
which implies a fractional error in h of order ~ e m x/h~ 10 -2 ! By equation (5.7.1) 
this immediately implies at least the same large fractional error in the derivative. 

We arrive at Lesson 1: Always choose h so that x + h and x differ by an exactly 
representable number. This can usually be accomplished by the program steps 


temp = x + h 
h = temp — x 


(5.7.4) 


Some optimizing compilers, and some computers whose floating-point chips have 
higher internal accuracy than is stored externally, can foil this trick; if so, it is 
usually enough to declare temp as volatile, or else to call a dummy function 
donothing(temp) between the two equations (5.7.4). This forces temp into and 
out of addressable memory. 



S, § g 
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With h an “exact” number, the roundoff error in equation (5.7.1) is e r ~ 
ef\f( x )/h\. Here €f is the fractional accuracy with which / is computed; for a 
simple function this may be comparable to the machine accuracy, e f « e m , but for a 
complicated calculation with additional sources of inaccuracy it may be larger. The 
truncation error in equation (5.7.3) is on the order of e t ~ \hf"(x)\. Varying h to 
minimize the sum e r + e t gives the optimal choice of h, 



where x c = if /is the “curvature scale” of the function /, or “characteristic 
scale” over which it changes. In the absence of any other information, one often 
assumes x c = x (except near x = 0 where some other estimate of the typical x 
scale should be used). 

With the choice of equation (5.7.5), the fractional accuracy of the computed 
derivative is 


(er + e t )/\r\ ~ ~Ve/ (5.7.6) 

Here the last order-of-magnitude equality assumes that /, /', and /" all share 
the same characteristic length scale, usually the case. One sees that the simple 
finite-difference equation (5.7.1) gives at best only the square root of the machine 
accuracy e m . 

If you can afford two function evaluations for each derivative calculation, then 
it is significantly better to use the symmetrized form 

/'M~ /(x + V (x -' ,) (5J ' 7) 

In this case, by equation (5.7.2), the truncation error is e t ~ h 2 f". The roundoff 
error e r is about the same as before. The optimal choice of h, by a short calculation 
analogous to the one above, is now 

h ~ ^ ~ (e f ) 1/3 x c (5.7.8) 


and the fractional error is 

(er + et)/\f'\ ~ (e,) 2 / 3 / 2 ^/'") 173 //' ~ (e/) 2/3 (5.7.9) 

which will typically be an order of magnitude (single precision) or two orders of 
magnitude (double precision) better than equation (5.7.6). We have arrived at Lesson 
2: Choose h to be the correct power of e/ or e m times a characteristic scale x c . 

You can easily derive the correct powers for other cases [1 ]. For a function of 
two dimensions, for example, and the mixed derivative formula 

d 2 / = [f(x +h,y+h)- f(x +h,y-h)}~ [f(x - h, y + h) - f(x - h, y - h)} 
dxdy 4 h 2 



(5.7.10) 
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the correct scaling is h ~ tj x c . 

It is disappointing, certainly, that no simple finite-difference formula like 
equation (5.7.1) or (5.7.7) gives an accuracy comparable to the machine accuracy 
e TO , or even the lower accuracy to which / is evaluated, e /. Are there no better 
methods? 

Yes, there are. All, however, involve exploration of the function’s behavior over 
scales comparable to x c , plus some assumption of smoothness, or analyticity, so that 
the high-order terms in a Taylor expansion like equation (5.7.2) have some meaning. 
Such methods also involve multiple evaluations of the function /, so their increased 
accuracy must be weighed against increased cost. 

The general idea of “Richardson’s deferred approach to the limit” is particularly 
attractive. For numerical integrals, that idea leads to so-called Romberg integration 
(for review, see §4.3). For derivatives, one seeks to extrapolate, to h —> 0, the result 
of finite-difference calculations with smaller and smaller finite values of h. By the 
use of Neville’s algorithm (§3.1), one uses each new finite-difference calculation to 
produce both an extrapolation of higher order, and also extrapolations of previous, 
lower, orders but with smaller scales h. Ridders [2] has given a nice implementation 
of this idea; the following program, df ridr, is based on his algorithm, modified by 
an improved termination criterion. Input to the routine is a function / (called f unc), 
a position x, and a largest stepsize h (more analogous to what we have called x c 
above than to what we have called h ). Output is the returned value of the derivative, 
and an estimate of its error, err. 


#include <math.h> 

#include "nrutil.h" 

#define CON 1.4 Stepsize is decreased by CON at each iteration. 

#define C0N2 (C0N*C0N) 

#define BIG 1.0e30 

#define NTAB 10 Sets maximum size of tableau. 

#define SAFE 2.0 Return when error is SAFE worse than the best so 

far. 

float dfridr(float (*func)(float), float x, float h, float *err) 

Returns the derivative of a function func at a point x by Ridders' method of polynomial 
extrapolation. The value h is input as an estimated initial stepsize; it need not be small, but 
rather should be an increment in x over which func changes substantially. An estimate of the 
error in the derivative is returned as err. 

{ 

int i, j ; 

float errt,fac,hh,**a,ans; 

if (h == 0.0) nrerror("h must be nonzero in dfridr."); 

a=matrix(1,NTAB,1,NTAB); 

hh=h; 

a[l] [l] = ((*func)(x+hh)-(*func)(x-hh))/(2.0*hh); 

*err=BIG; 

for (i=2;i<=NTAB;i++) { 

Successive columns in the Neville tableau will go to smaller stepsizes and higher orders of 
extrapolation, 
hh /= CON; 

a[l] [i] = ((*func) (x+hh)-(*func) (x-hh))/(2.0*hh); Try new, smaller step- 

f ac=C0N2; size, 

for (j=2; j<=i; j++) { Compute extrapolations of various orders, requiring 

a[j] [i] = (a[j-l] [i] *fac-a[j-l] [i-1] )/(fac-l .0) ; no new function eval- 

fac=C0N2*fac; uations. 

errt=FMAX(fabs(a[j] [i]-a[j-l] [i] ) ,f abs(a[j] [i]-a[j-l] [i-1])); 
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The error strategy is to compare each new extrapolation to one order lower, both 
at the present stepsize and the previous one. 

if (errt <= *err) { If error is decreased, save the improved answer. 
*err=errt; 
ans=a[j] [i] ; 

> 

> 

if (fabs(a[i] [i]-a[i-l] [i-1] ) >= SAFE*(*err)) break; 

If higher order is worse by a significant factor SAFE, then quit early. 

> 

free_matrix(a,1,NTAB,1,NTAB); 
return ans; 


In df ridr, the number of evaluations of f unc is typically 6 to 12, but is allowed 
to be as great as 2xNTAB. As a function of input h, it is typical for the accuracy 
to get better as h is made larger, until a sudden point is reached where nonsensical 
extrapolation produces early return with a large error. You should therefore choose 
a fairly large value for h, but monitor the returned value err, decreasing h if it is 
not small. For functions whose characteristic x scale is of order unity, we typically 
take h to be a few tenths. 

Besides Ridders’ method, there are other possible techniques. If your function 
is fairly smooth, and you know that you will want to evaluate its derivative many 
times at arbitrary points in some interval, then it makes sense to construct a 
Chebyshev polynomial approximation to the function in that interval, and to evaluate 
the derivative directly from the resulting Chebyshev coefficients. This method is 
described in §§5.8-5.9, following. 

Another technique applies when the function consists of data that is tabulated 
at equally spaced intervals, and perhaps also noisy. One might then want, at each 
point, to least-squares fit a polynomial of some degree M, using an additional 
number ul of points to the left and some number tir of points to the right of each 
desired x value. The estimated derivative is then the derivative of the resulting 
fitted polynomial. A very efficient way to do this construction is via Savitzky-Golay 
smoothing filters, which will be discussed later, in §14.8. There we will give a 
routine for getting filter coefficients that not only construct the fitting polynomial but, 
in the accumulation of a single sum of data points times filter coefficients, evaluate 
it as well. In fact, the routine given, savgol, has an argument Id that determines 
which derivative of the fitted polynomial is evaluated. For the first derivative, the 
appropriate setting is ld=l, and the value of the derivative is the accumulated sum 
divided by the sampling interval h. 


CITED REFERENCES AND FURTHER READING: 

Dennis, J.E., and Schnabel, R.B. 1983, Numerical Methods for Unconstrained Optimization and 
Nonlinear Equations (Englewood Cliffs, NJ: Prentice-Hall), §§5.4-5.6. [1] 

Ridders, C.J.F. 1982, Advances in Engineering Software, vol. 4, no. 2, pp. 75-76. [2] 
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5.8 Chebyshev Approximation 

The Chebyshev polynomial of degree n is denoted T n (x), and is given by 
the explicit formula 


T n (x) = cos(n arccos x) (5.8.1) 

This may look trigonometric at first glance (and there is in fact a close relation 
between the Chebyshev polynomials and the discrete Fourier transform); however 
(5.8.1) can be combined with trigonometric identities to yield explicit expressions 
for T n (x) (see Figure 5.8.1), 


To (a) = 1 
Ti(x) = x 
T 2 {x) = 2x 2 - 1 

T 3 {x)= 4a; 3 -3a; (5.8.2) 

14 ( 2 ;) = 8a; 4 — 8a; 2 + 1 



T n+ i(x) = 2 xT n (x) — T n -i(x) n> 1. 


(There also exist inverse formulas for the powers of x in terms of the T„’s — see 
equations 5.11.2-5.11.3.) 

The Chebyshev polynomials are orthogonal in the interval [—1,1] over a weight 
(1 - a; 2 ) -1 / 2 . In particular, 


Ti(a)r 3 -(a) 

\/l - X 2 


dx = 


0 

tt/2 


i ^ j 
i = 3 ± 0 
i = j = 0 


(5.8.3) 


The polynomial T n (x) has n zeros in the interval [—1,1], and they are located 
at the points 


X = cos 



k= 1,2,...,» 


(5.8.4) 



In this same interval there are n + 1 extrema (maxima and minima), located at 


X = cos 



k = 0,1,..., n 


(5.8.5) 




At all of the maxima T n (x) = 1, while at all of the minima T n (x) = — 1; 
it is precisely this property that makes the Chebyshev polynomials so useful in 
polynomial approximation of functions. 
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Figure 5.8.1. Chebyshev polynomials To(x) through T^{x). Note that Tj has j roots in the interval 
(—1,1) and that all the polynomials are bounded between ±1. 


The Chebyshev polynomials satisfy a discrete orthogonality relation as well as 
the continuous one (5.8.3): If Xk (k = 1,..., m) are the to zeros of T rn (x) given 
by (5.8.4), and if i,j < to, then 



m f 0 i ^ j 

'^2,T i (x k )T j {xk) = lm/2 i=j^ 0 

fc=l l to i = j = 0 


(5.8.6) 


It is not too difficult to combine equations (5.8.1), (5.8.4), and (5.8.6) to prove 
the following theorem: If f(x) is an arbitrary function in the interval [—1,1], and if 
N coefficients Cj. j = 0..... TV — 1, are defined by 


m 


fc=l 




(5.8.7) 



then the approximation formula 


m 


~N-1 

^2 CkT k (x ) 


L k=o 



(5.8.8) 


imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 





192 


Chapter 5. Evaluation of Functions 


is exact for x equal to all of the N zeros of T^(x). 

For a fixed N, equation (5.8.8) is a polynomial in x which approximates the 
function f (x) in the interval [—1,1] (where all the zeros of T N (x) are located). Why 
is this particular approximating polynomial better than any other one, exact on some 
other set of N points? The answer is not that (5.8.8) is necessarily more accurate 
than some other approximating polynomial of the same order N (for some specified 
definition of “accurate”), but rather that (5.8.8) can be truncated to a polynomial of 
lower degree to -C N in a very graceful way, one that does yield the “most accurate” 
approximation of degree to (in a sense that can be made precise). Suppose N is 
so large that (5.8.8) is virtually a perfect approximation of f(x). Now consider 
the truncated approximation 

- ^c 0 (5.8.9) 


/(*)’ 


Y <*T k (x) 


with the same c/s, computed from (5.8.7). Since the Tk(x)’s are all bounded 
between ±1, the difference between (5.8.9) and (5.8.8) can be no larger than the 
sum of the neglected Ck s (k = to, ..., N — 1). In fact, if the c^’s are rapidly 
decreasing (which is the typical case), then the error is dominated by c m T m (x), 
an oscillatory function with to + 1 equal extrema distributed smoothly over the 
interval [—1,1]. This smooth spreading out of the error is a very important property: 
The Chebyshev approximation (5.8.9) is very nearly the same polynomial as that 
holy grail of approximating polynomials the minimax polynomial, which (among all 
polynomials of the same degree) has the smallest maximum deviation from the true 
function f(x). The minimax polynomial is very difficult to find; the Chebyshev 
approximating polynomial is almost identical and is very easy to compute! 

So, given some (perhaps difficult) means of computing the function /(#), we 
now need algorithms for implementing (5.8.7) and (after inspection of the resulting 
Cfc’s and choice of a truncating value to) evaluating (5.8.9). The latter equation then 
becomes an easy way of computing f(x) for all subsequent time. 

The first of these tasks is straightforward. A generalization of equation (5.8.7) 
that is here implemented is to allow the range of approximation to be between two 
arbitrary limits a and b, instead of just —1 to 1. This is effected by a change of variable 


x — b(b + a) 
V= \(b~a) 


(5.8.10) 


and by the approximation of f(x) by a Chebyshev polynomial in y. 


#include <math.h> 

#include "nrutil.h" 

#define PI 3.141592653589793 


void chebft(float a, float b, float c[], int n, float (*func)(float)) 

Chebyshev fit: Given a function func, lower and upper limits of the interval [a,b], and a 
maximum degree n, this routine computes the n coefficients c[0. .n-1] such that func(x) 

c kTk(y)\ — co/2, where y and x are related by (5.8.10). This routine is to be used with 
moderately large n (e.g., 30 or 50), the array of c's subsequently to be truncated at the smaller 
value m such that c m and subsequent elements are negligible. 

{ 

int k,j; 

float fac,bpa,bma,*f; 
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> 


f=vector ( 0 , 11 - 1 ); 
bma=0.5*(b-a); 
bpa=0.5*(b+a); 

for (k=0;k<n;k++) { We evaluate the function at the n points required 

float y=cos(PI*(k+0.5)/n); by (5.8.7). 

f[k]=(*func)(y*bma+bpa); 


> 

f ac=2.0/n; 

for (j=0;j<n;j++) { 

double sum=0.0; We will accumulate the sum in double precision, 

for (k=0;k<n;k++) a nicety that you can ignore. 

sum += f [k]*cos(PI*j*(k+0.5)/n); 
c[j]=fac*sum; 


> 

free_vector(f,0,n-l); 


(If you find that the execution time of chebf t is dominated by the calculation of 
N 2 cosines, rather than by the N evaluations of your function, then you should look 
ahead to §12.3, especially equation 12.3.22, which shows how fast cosine transform 
methods can be used to evaluate equation 5.8.7.) 

Now that we have the Chebyshev coefficients, how do we evaluate the approxi¬ 
mation? One could use the recurrence relation of equation (5.8.2) to generate values 
for Tfc(x) from To = l,Ti = x, while also accumulating the sum of (5.8.9). It 
is better to use Clenshaw’s recurrence formula (§5.5), effecting the two processes 
simultaneously. Applied to the Chebyshev series (5.8.9), the recurrence is 

d m+ 1 = d m = 0 

dj = 2xd j+1 - d j+ 2 + Cj j = to - 1, m - 2 ,..., 1 ( 5 .S.J!) 

f(x) =d 0 = xd\ - d 2 + ^c 0 


float chebev(float a, float b, float c[], int m, float x) 

Chebyshev evaluation: All arguments are input. c[0. .m-1] is an array of Chebyshev coeffi¬ 
cients, the first m elements of c output from chebft (which must have been called with the 
same a and b). The Chebyshev polynomial Y?k= o c fcTc(t/) — Co/2 is evaluated at a point 
y = [x — (b + a)/2]/[(b — a)/2], and the result is returned as the function value. 

{ 

void nrerror(char error_text []); 
float d=0.0, dd=0.0, sv, y, y2; 
int j; 

if ((x-a)*(x-b) > 0.0) nrerror("x not in range in routine chebev 11 ); 
y2=2.0*(y=(2.0*x-a-b)/(b-a)); Change of variable, 

for (j=m-l; j>=l;j—) { Clenshaw's recurrence. 

sv=d; 

d=y2*d-dd+c[j] ; 
dd=sv; 

> 

return y*d-dd+0.5*c[0]; 



} 


Last step is different. 
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If we are approximating an even function on the interval [—1,1], its expansion 
will involve only even Chebyshev polynomials. It is wasteful to call chebev with 
all the odd coefficients zero [1 ]. Instead, using the half-angle identity for the cosine 
in equation (5.8.1), we get the relation 

T 2n (a;) = T n (2x 2 - 1) (5.8.12) 

Thus we can evaluate a series of even Chebyshev polynomials by calling chebev 
with the even coefficients stored consecutively in the array c, but with the argument 
x replaced by 2x 2 — 1. 

An odd function will have an expansion involving only odd Chebyshev poly¬ 
nomials. It is best to rewrite it as an expansion for the function f(x)/x, which 
involves only even Chebyshev polynomials. This will give accurate values for 
f(x)/x near x = 0. The coefficients c' n for f(x)/x can be found from those for 
f(x) by recurrence: 

Cjv+i = 0 

, , (5.8.13) 

c n _i = 2c n - c n+1 , n = N -1,N-3,... 

Equation (5.8.13) follows from the recurrence relation in equation (5.8.2). 

If you insist on evaluating an odd Chebyshev series, the efficient way is to once 
again use chebev with x replaced by y = 2x 2 — 1, and with the odd coefficients 
stored consecutively in the array c. Now, however, you must also change the last 
formula in equation (5.8.11) to be 

f(x) = x[(2y - l)di — d 2 + c 0 ] (5.8.14) 

and change the corresponding line in chebev. 


CITED REFERENCES AND FURTHER READING: 

Clenshaw, C.W. 1962, Mathematical Tables, vol. 5, National Physical Laboratory, (London: H.M. 
Stationery Office). [1] 

Goodwin, E.T. (ed.) 1961, Modern Computing Methods , 2nd ed. (New York: Philosophical Li¬ 
brary), Chapter 8. 

Dahlquist, G., and Bjorck, A. 1974, Numerical Methods (Englewood Cliffs, NJ: Prentice-Hall), 
§4.4.1, p. 104. 

Johnson, L.W., and Riess, R.D. 1982, Numerical Analysis, 2nd ed. (Reading, MA: Addison- 
Wesley), §6.5.2, p. 334. 

Carnahan, B., Luther, H.A., and Wilkes, J.O. 1969, Applied Numerical Methods (New York: 
Wiley), §1.10, p. 39. 
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5.9 Derivatives or Integrals of a 
Chebyshev-approximated Function 


If you have obtained the Chebyshev coefficients that approximate a function in 
a certain range (e.g., from chebft in §5.8), then it is a simple matter to transform 
them to Chebyshev coefficients corresponding to the derivative or integral of the 
function. Having done this, you can evaluate the derivative or integral just as if it 
were a function that you had Chebyshev-fitted ab initio. 

The relevant formulas are these: If a, i = 0,..., m — 1 are the coefficients 
that approximate a function / in equation (5.8.9), C\ are the coefficients that 
approximate the indefinite integral of /, and c\ are the coefficients that approximate 
the derivative of /, then 

° i= Ci ~ 1 2i Ci+1 {i>0) (5 - 9 - 1} 


c' i _ 1 = c' i+1 + 2ici (i = m — 1, to — 2,..., 1) (5.9.2) 

Equation (5.9.1) is augmented by an arbitrary choice of C o, corresponding to an 
arbitrary constant of integration. Equation (5.9.2), which is a recurrence, is started 
with the values c' m = <-' rn _\ = 0, corresponding to no information about the m + 1st 
Chebyshev coefficient of the original function /. 

Here are routines for implementing equations (5.9.1) and (5.9.2). 

void chder(float a, float b, float c[], float cder[], int n) 

Given a,b,c[0. .n-1] , as output from routine chebft §5.8, and given n, the desired degree 
of approximation (length of c to be used), this routine returns the array cder[0. .n-1] , the 
Chebyshev coefficients of the derivative of the function whose coefficients are c. 

{ 

int j; 
float con; 

cder[n-1]=0.0; 
cder[n-2]=2*(n-1)*c [n-1] ; 
for (j=n-3;]>=0;j—) 

cder [ j ] =cder [ j +2] +2* (j+l)*c[j+l] 
con=2.0/(b-a); 
for (j=0;]<n;j++) 
cder[j] *= con; 


void chint(float a, float b, float c[], float cint[], int n) 

Given a,b,c[0. .n-1] , as output from routine chebft §5.8, and given n, the desired degree 
of approximation (length of c to be used), this routine returns the array cint[0. .n-1] , the 
Chebyshev coefficients of the integral of the function whose coefficients are c. The constant of 
integration is set so that the integral vanishes at a. 

{ 

int j; 

float sum=0.0,fac=l.0,con; 

Factor that normalizes to the interval b-a. 


n-1 and n-2 are special cases. 

Equation (5.9.2). 

Normalize to the interval b-a. 



con=0.25*(b-a); 
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for (j=l;j<=n-2;j++) { 

cint [j] =con*(c [j-1]-c[j+1] )/j ; 
sum += fac*cint[j]; 
fac = -fac; 

> 

cint[n-l]=con*c[n-2]/(n-l); 
sum += fac*cint[n-1]; 
cint[0]=2.0*sum; 


Equation (5.9.1). 

Accumulates the constant of integration. 
Will equal ±1. 

Special case of (5.9.1) for n-1. 

Set the constant of integration. 


Clenshaw-Curtis Quadrature 

Since a smooth function’s Chebyshev coefficients Cj decrease rapidly, generally expo¬ 
nentially, equation (5.9.1) is often quite efficient as the basis for a quadrature scheme. The 
routines chebft and chint, used in that order, can be followed by repeated calls to chebev 
if f x f(x)dx is required for many different values of x in the range a < x < b. 

If only the single definite integral Jf{x)dx is required, then chint and chebev are 
replaced by the simpler formula, derived from equation (5.9.1), 

J f(x)dx = (b — a) jjc 0 - ^ c 2 - ^c 4 

(5.9.3) 

where the c,’s are as returned by chebft. The series can be truncated when c^k becomes 
negligible, and the first neglected term gives an error estimate. 

This scheme is known as Clenshaw-Curtis quadrature [1 ]. It is often combined with an 
adaptive choice of N, the number of Chebyshev coefficients calculated via equation (5.8.7), 
which is also the number of function evaluations of f(x). If a modest choice of N does 
not give a sufficiently small c 2 fc in equation (5.9.3), then a larger value is tried. In this 
adaptive case, it is even better to replace equation (5.8.7) by the so-called “trapezoidal” or 
Gauss-Lobatto (§4.5) variant, 

e 4 = n g" / [ cos (^)] cos j = 0,...,N-l (5.9.4) 

where (N.B.!) the two primes signify that the first and last terms in the sum are to be 
multiplied by 1/2. If N is doubled in equation (5.9.4), then half of the new function 
evaluation points are identical to the old ones, allowing the previous function evaluations to be 
reused. This feature, plus the analytic weights and abscissas (cosine functions in 5.9.4), give 
Clenshaw-Curtis quadrature an edge over high-order adaptive Gaussian quadrature (cf. §4.5), 
which the method otherwise resembles. 

If your problem forces you to large values of N, you should be aware that equation (5.9.4) 
can be evaluated rapidly, and simultaneously for all the values of j, by a fast cosine transform. 
(See §12.3, especially equation 12.3.17.) (We already remarked that the nontrapezoidal form 
(5.8.7) can also be done by fast cosine methods, cf. equation 12.3.22.) 


(2fc + l)(2fc- l) C2fc 


CITED REFERENCES AND FURTHER READING: 

Goodwin, E.T. (ed.) 1961, Modem Computing Methods, 2nd ed. (New York: Philosophical Li¬ 
brary), pp. 78-79. 

Clenshaw, C.W., and Curtis, A.R. 1960, Numerische Mathematik, vol. 2, pp. 197-205. [1] 



*■ 3] 
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5.10 Polynomial Approximation from 
Chebyshev Coefficients 

You may well ask after reading the preceding two sections, “Must I store and 
evaluate my Chebyshev approximation as an array of Chebyshev coefficients for a 
transformed variable yl Can’t I convert the c k s into actual polynomial coefficients 
in the original variable x and have an approximation of the following form?” 

m -1 

f(x) w g k x k (5.10.1) 

k=0 


Yes, you can do this (and we will give you the algorithm to do it), but we 
caution you against it: Evaluating equation (5.10.1), where the coefficient g’s reflect 
an underlying Chebyshev approximation, usually requires more significant figures 
than evaluation of the Chebyshev sum directly (as by chebev). This is because 
the Chebyshev polynomials themselves exhibit a rather delicate cancellation: The 
leading coefficient of T n (x), for example, is 2 n_1 ; other coefficients of T n (x) are 
even bigger; yet they all manage to combine into a polynomial that lies between ±1. 
Only when m is no larger than 7 or 8 should you contemplate writing a Chebyshev 
fit as a direct polynomial, and even in those cases you should be willing to tolerate 
two or so significant figures less accuracy than the roundoff limit of your machine. 

You get the g 's in equation (5.10.1) from the c’s output from chebf t (suitably 
truncated at a modest value of m) by calling in sequence the following two procedures: 


#include "nrutil.h" 

void chebpc(float c[], float d[], int n) 

Chebyshev polynomial coefficients. Given a coefficient array c [0. .n-1] , this routine generates 
a coefficient array d[0. .n-1] such that Xut=o d kV k = c kf'k{y ) — Co/2. The method 

is Clenshaw’s recurrence (5.8.11), but now applied algebraically rather than arithmetically. 

f 

int k,j; 
float sv,*dd; 

dd=vector(0,n-l); 

for (j=0;j<n;j++) d[j]=dd[j]=0.0; 

d[0]=c[n-l] ; 

for (j=n-2;j>=l;j—) { 

for (k=n-j;k>=l;k—) { 
sv=d[k] ; 

d [k] =2.0*d [k-1] -dd [k] ; 
dd[k]=sv; 

} 

sv=d[0] ; 

d[0] = -dd[0]+c[j] ; 
dd[0]=sv; 

> 

for (j=n-l;j>=l;j—) 
d[j]=d[j-l]-dd[j] ; 
d[0] = -dd [0] +0.5*c [0] ; 
free_vector(dd,0,n-l); 



} 
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void pcshft(float a, float b, float d[], int n) 

Polynomial coefficient shift. Given a coefficient array d[0..n-l], this routine generates a 
coefficient array g[0. .n-1] such that ^ kV k — Sfc=o 9k xk > where x and y are related 

by (5.8.10), i.e., the interval —1 < y < 1 is mapped to the interval a < x < b. The array 
g is returned in d. 

{ 

int k,j; 
float fac,cnst; 

cnst=2.0/(b-a); 
fac=cnst; 

for (j=l;j<n;j++) { First we rescale by the factor const... 

d[j] *= fac; 
fac *= cnst; 

> 

cnst=0.5*(a+b) ; ...which is then redefined as the desired shift, 

for (j=0; j<=n-2; j++) We accomplish the shift by synthetic division. Synthetic 

for (k=n-2;k>=j ;k—) division is a miracle of high-school algebra. If you 

d[k] -= cnst*d[k+l] ; never learned it, go do so. You won't be sorry. 

> 



CITED REFERENCES AND FURTHER READING: 

Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), pp. 59, 182-183 [synthetic division]. 



5.11 Economization of Power Series 


One particular application of Chebyshev methods, the economization of power series, is 
an occasionally useful technique, with a flavor of getting something for nothing. 

Suppose that you are already computing a function by the use of a convergent power 
series, for example 

(This function is actually sm.(^/x)/y/x, but pretend you don’t know that.) You might be 
doing a problem that requires evaluating the series many times in some particular interval, say 
[0, (2 tt) 2 ]. Everything is fine, except that the series requires a large number of terms before 
its error (approximated by the first neglected term, say) is tolerable. In our example, with 
x = (2tt) 2 , the first term smaller than 10 -7 is cc 13 /(27!). This then approximates the error 
of the finite series whose last term is 12 /(25!). 

Notice that because of the large exponent in a; 13 , the error is much smaller than 10 -7 
everywhere in the interval except at the very largest values of x. This is the feature that allows 
“economization”: if we are willing to let the error elsewhere in the interval rise to about the 
same value that the first neglected term has at the extreme end of the interval, then we can 
replace the 13-term series by one that is significantly shorter. 

Here are the steps for doing so: 

1. Change variables from x to y, as in equation (5.8.10), to map the x interval into 

-i < y < i. 

2. Find the coefficients of the Chebyshev sum (like equation 5.8.8) that exactly equals your 
truncated power series (the one with enough terms for accuracy). 

3. Truncate this Chebyshev series to a smaller number of terms, using the coefficient of the 
first neglected Chebyshev polynomial as an estimate of the error. 
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4. Convert back to a polynomial in y. 

5. Change variables back to x. 

All of these steps can be done numerically, given the coefficients of the original power 
series expansion. The first step is exactly the inverse of the routine pcshft (§5.10), which 
mapped a polynomial from y (in the interval [—1,1]) to x (in the interval [a, 6]). But since 
equation (5.8.10) is a linear relation between x and y, one can also use pcshft for the 
inverse. The inverse of 


pcshft(a,6,d,n) 

turns out to be (you can check this) 


pcshft 


-2-b-a 
b — a 


2-b-a 
b — a 



The second step requires the inverse operation to that done by the routine chebpc (which 
took Chebyshev coefficients into polynomial coefficients). The following routine, pccheb, 
accomplishes this, using the formula [1 ] 


1 

: 2 *=! 


T k { x) + }T k - 2 {x) + T fc _ 4 (®) + 


(5.11.2) 


where the last term depends on whether k is even or odd, 


■ + Uk-\y*) Tl to (fcodd) ’ 


2 V As/2 


• - , A L. )'/o(a0 (fceven). (5.11.3) 


void pccheb(float d[] , float c[], int n) 

Inverse of routine chebpc: given an array of polynomial coefficients d[0. .n-1], returns an 
equivalent array of Chebyshev coefficients c[0. .n-1] . 
f 

int j,jm,jp,k; 
float fac,pow; 

pow=1.0; Will be powers of 2. 

c [0] =2.0*d[0] ; 

for (k=l;k<n;k++) { Loop over orders of x in the polynomial. 

c[k]=0.0; Zero corresponding order of Chebyshev. 

fac=d[k] /pow; 


jp=i; 

for Cj=k;j>=0;j-=2,jm— ,jp++) { 

Increment this and lower orders of Chebyshev with the combinatorial coefficent times 
d[k]; see text for formula. 
c[j] += fac; 

fac *= ((float)jm)/((float)jp); 

} 

pow += pow; 

> 

> 



The fourth and fifth steps are accomplished by the routines chebpc and pcshft, 
respectively. Here is how the procedure looks all together: 
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#def ine NFEW . . 

#define NMANY . . 

float *c ) *d,*e,a,b; 

Economize NMANY power series coefficients e [0. . NMANY-1] in the range (a, 6) into NFEW 
coefficients d[0. .NFEW-1]. 

c=vector(0,NMANY-1); 
d=vector(0,NFEW-1); 
e=vector(0,NMANY-1); 

pcshft((-2.0-b-a)/(b-a),(2.0-b-a)/(b-a),e,NMANY); 
pccheb(e,c,NMANY); 

Here one would normally examine the Chebyshev coefficients c[0. .NMANY-1] to decide 
how small NFEW can be. 
chebpc(c,d,NFEW); 
pcshft(a,b,d,NFEW); 

In our example, by the way, the 8th through 10th Chebyshev coefficients turn out to 
be on the order of —7 x 10 -6 , 3 x 10 -7 , and —9 x 10 -9 , so reasonable truncations (for 
single precision calculations) are somewhere in this range, yielding a polynomial with 8 - 
10 terms instead of the original 13. 

Replacing a 13-term polynomial with a (say) 10-term polynomial without any loss of 
accuracy — that does seem to be getting something for nothing. Is there some magic in 
this technique? Not really. The 13-term polynomial defined a function f(x). Equivalent to 
economizing the series, we could instead have evaluated f(x) at enough points to construct 
its Chebyshev approximation in the interval of interest, by the methods of §5.8. We would 
have obtained just the same lower-order polynomial. The principal lesson is that the rate 
of convergence of Chebyshev coefficients has nothing to do with the rate of convergence of 
power series coefficients; and it is the former that dictates the number of terms needed in a 
polynomial approximation. A function might have a divergent power series in some region 
of interest, but if the function itself is well-behaved, it will have perfectly good polynomial 
approximations. These can be found by the methods of §5.8, but not by economization of 
series. There is slightly less to economization of series than meets the eye. 


CITED REFERENCES AND FURTHER READING: 

Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), Chapter 12. 

Arfken, G. 1970, Mathematical Methods for Physicists, 2nd ed. (New York: Academic Press), 
p. 631. [1] 


5.12 Fade Approximants 

A Pade approximant, so called, is that rational function (of a specified order) whose 
power series expansion agrees with a given power series to the highest possible order. If 
the rational function is 



R(x) = 


M 

k =0 
N 

i + ^2 b kx;k 

k =1 


(5.12.1) 
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then R(x) is said to be a Pade approximant to the series 


f(x) = J2^x k (5.12.2) 


R( 0) = f(0) (5-12.3) 

and also 

fe=i ’ 2 ’---’ M+Ar (s - i2 - 4) 

Equations (5.12.3) and (5.12.4) furnish M + N + 1 equations for the unknowns ao, ■ ■ ■, um 
and bi,... ,bN- The easiest way to see what these equations are is to equate (5.12.1) and 
(5.12.2), multiply both by the denominator of equation (5.12.1), and equate all powers of 
x that have either o’s or b’s in their coefficients. If we consider only the special case of 
a diagonal rational approximation, M = N (cf. §3.2), then we have ao = Co, with the 
remaining a’s and b’s satisfying 


N 


^ bmCN-m+k = ~ CiV+fc, 
m= 1 

k = l,...,N 

(5.12.5) 

k 

^ bmCk-m = ak, 

k — 1..... N 

(5.12.6) 


m= 0 


(note, in equation 5.12.1, that bo = 1). To solve these, start with equations (5.12.5), which 
are a set of linear equations for all the unknown b’s. Although the set is in the form of a 
Toeplitz matrix (compare equation 2.8.8), experience shows that the equations are frequently 
close to singular, so that one should not solve them by the methods of §2.8, but rather by 
full LU decomposition. Additionally, it is a good idea to refine the solution by iterative 
improvement (routine mprove in §2.5) [1 ]. 

Once the b’s are known, then equation (5.12.6) gives an explicit formula for the unknown 
a’s, completing the solution. 

Pade approximants are typically used when there is some unknown underlying function 
f(x). We suppose that you are able somehow to compute, perhaps by laborious analytic 
expansions, the values of f(x) and a few of its derivatives at x = 0: /(0), /'(0), /"(0), 
and so on. These are of course the first few coefficients in the power series expansion of 
f(x); but they are not necessarily getting small, and you have no idea where (or whether) 
the power series is convergent. 

By contrast with techniques like Chebyshev approximation (§5.8) or economization 
of power series (§5.11) that only condense the information that you already know about a 
function, Pade approximants can give you genuinely new information about your function’s 
values. It is sometimes quite mysterious how well this can work. (Like other mysteries in 
mathematics, it relates to analyticity.) An example will illustrate. 

Imagine that, by extraordinary labors, you have ground out the first five terms in the 
power series expansion of an unknown function f(x). 


f(x) 


* 2+ r + 8l a:2 “ 8748~ 


(5.12.7) 


(It is not really necessary that you know the coefficients in exact rational form — numerical 
values are just as good. We here write them as rationals to give you the impression that 
they derive from some side analytic calculation.) Equation (5.12.7) is plotted as the curve 
labeled “power series” in Figure 5.12.1. One sees that for x ^ 4 it is dominated by its 
largest, quartic, term. 

We now take the five coefficients in equation (5.12.7) and run them through the routine 
pade listed below. It returns five rational coefficients, three a’s and two b’s, for use in equation 
(5.12.1) with M = N = 2. The curve in the figure labeled “Pade” plots the resulting rational 
function. Note that both solid curves derive from the same five original coefficient values. 



*■ 3 ] 
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Figure 5.12.1. The five-term power series expansion and the derived five-coefficient Pad; approximant 
for a sample function f(x). The full power series converges only for £ < 1. Note that the Pad: 
approximant maintains accuracy far outside the radius of convergence of the series. 

To evaluate the results, we need Deus ex machina (a useful fellow, when he is available) 
to tell us that equation (5.12.7) is in fact the power series expansion of the function 

f{x) = [7+(1 + z) 4/3 ] 1/3 (5.12.8) 

which is plotted as the dotted curve in the figure. This function has a branch point at a; = —1, 
so its power series is convergent only in the range — 1 < x < 1. In most of the range 
shown in the figure, the series is divergent, and the value of its truncation to five terms is 
rather meaningless. Nevertheless, those five terms, converted to a Pade approximant, give a 
remarkably good representation of the function up to at least x ~ 10. 

Why does this work? Are there not other functions with the same first five terms in 
their power series, but completely different behavior in the range (say) 2 < x < 10? Indeed 
there are. Pade approximation has the uncanny knack of picking the function you had in 
mind from among all the possibilities. Except when it doesn’t! That is the downside of 
Pade approximation: it is uncontrolled. There is, in general, no way to tell how accurate 
it is, or how far out in x it can usefully be extended. It is a powerful, but in the end still 
mysterious, technique. 

Here is the routine that gets a’s and b’s from your c’s. Note that the routine is specialized 
to the case M = N, and also that, on output, the rational coefficients are arranged in a format 
for use with the evaluation routine ratval (§5.3). (Also for consistency with that routine, 
the array of c’s is passed in double precision.) 

#include <math.h> 

#include "nrutil.h" 

#define BIG 1.0e30 

void pade(double cof[], int n, float *resid) 

Given cof [0. .2*n] , the leading terms in the power series expansion of a function, solve the 
linear Pade equations to return the coefficients of a diagonal rational function approximation to 
the same function, namely (cof [0] + cof [1] x + • • • + cof [n] x N )/(l + cof [n+1] x + ■ ■ • + 
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cof [2*n] x N ). The value resid is the norm of the residual vector; a small value indicates a 
well-converged solution. Note that cof is double precision for consistency with ratval. 

{ 

void lubksb(float **a, int n, int *indx, float b[]); 
void ludcmp(float **a, int n, int *indx, float *d); 
void mprove(float **a, float **alud, int n, int indx[], float b[], 
float x [] ) ; 
int j,k,*indx; 

float d,rr,rrold,sum,**q,**qlu,*x,*y,*z; 


> 


indx=ivector(l,n); 
q^atrixfl.n.ljn); 
qlu=matrix(1,n,1,n); 
x=vector(l,n); 
y=vector(l,n); 
z=vector(l,n); 
for (j=l;j<=n;j++) { 
y[j]=x[j]=cof [n+j] ; 
for (k=l;k<=n;k++) { 
q[j] [k] =cof [j-k+n] ; 
qlu [ j ] [k] =q [ j ] [k] ; 

> 

> 

ludcmp(qlu,n,indx,fed); 
lubksb(qlu,n,indx,x); 
rr=BIG; 
do { 

rrold=rr; 

for (j=l;j<=n;j++) z[j]=x[j]; 
mprove(q,qlu,n,indx,y,x); 
for (rr=0.0,j=l;j<=n;j++) 
rr += SQR(z [j] -x [j]); 

} while (rr < rrold); 

*resid=sqrt(rrold); 
for (k=l;k<=n;k++) { 

for (sum=cof[k],j=l;j<=k;j++) 
y [k] =sum; 

> 

for (j=l;j<=n;j++) { 
cof [j]=y [j] ; 
cof [j+n] = -z[j] ; 

> 

free_vector(z,l,n); 
free_vector(y,l,n); 
free_vector(x,1,n); 
f ree_matrix(qlu,1,n,1,n); 
f ree_matrix(q,1,n,1,n); 
free_ivector(indx,1,n); 


Set up matrix for solving. 


Solve by LU decomposition and backsubstitu- 
tion. 

Important to use iterative improvement, since 
the Pade equations tend to be ill-conditioned. 


Calculate residual. 

If it is no longer improving, call it quits. 

Calculate the remaining coefficients, 
sum -= z [j] *cof [k-j] ; 

Copy answers to output. 


CITED REFERENCES AND FURTHER READING: 

Ralston, A. and Wilf, H.S. 1960, Mathematical Methods for Digital Computers (New York: Wiley), 
p. 14. 

Cuyt, A., and Wuytack, L. 1987, Nonlinear Methods in Numerical Analysis (Amsterdam: North- 
Holland), Chapter 2. 

Graves-Morris, P.R. 1979, in Pade Approximation and Its Applications, Lecture Notes in Mathe¬ 
matics, vol. 765, L. Wuytack, ed. (Berlin: Springer-Verlag). [1] 
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5.13 Rational Chebyshev Approximation 


In §5.8 and §5.10 we learned how to find good polynomial approximations to a given 
function f(x) in a given interval a < x < b. Here, we want to generalize the task to find 
good approximations that are rational functions (see §5.3). The reason for doing so is that, 
for some functions and some intervals, the optimal rational function approximation is able 
to achieve substantially higher accuracy than the optimal polynomial approximation with the 
same number of coefficients. This must be weighed against the fact that finding a rational 
function approximation is not as straightforward as finding a polynomial approximation, 
which, as we saw, could be done elegantly via Chebyshev polynomials. 

Let the desired rational function R(x) have numerator of degree m and denominator 
of degree k. Then we have 


, = PO + PlX + ■ • • + PmX _ 

1 + gi * + ••• + «*** ~ n> 


for a < x < b 


(5.13.1) 


The unknown quantities that we need to find are po , • • •, Pm and q\,, qi : , that is, rn + k + 1 
quantities in all. Let r(x) denote the deviation of R(x) from f(x), and let r denote its 
maximum absolute value. 


r{x) = R(x) — f(x) r= max |r(a:)| (5.13.2) 


The ideal minimax solution would be that choice of p’s and q’s that minimizes r. Obviously 
there is some minimax solution, since r is bounded below by zero. How can we find it, or 
a reasonable approximation to it? 

A first hint is furnished by the following fundamental theorem: If R(x) is nondegenerate 
(has no common polynomial factors in numerator and denominator), then there is a unique 
choice of p’s and q’s that minimizes r; for this choice, r(x) has m + k + 2 extrema in 
a < x < b, all of magnitude r and with alternating sign. (We have omitted some technical 
assumptions in this theorem. See Ralston [1 ] for a precise statement.) We thus learn that the 
situation with rational functions is quite analogous to that for minimax polynomials: In §5.8 
we saw that the error term of an nth order approximation, with n + 1 Chebyshev coefficients, 
was generally dominated by the first neglected Chebyshev term, namely T n+ 1 , which itself 
has n + 2 extrema of equal magnitude and alternating sign. So, here, the number of rational 
coefficients, m + k + 1, plays the same role of the number of polynomial coefficients, n + 1. 

A different way to see why r(x) should have m + k + 2 extrema is to note that R(x) 
can be made exactly equal to f(x) at any m + k + 1 points Xi. Multiplying equation (5.13.1) 
by its denominator gives the equations 

Po+piXi-\ -h PmX™ = f(xi)(l +qiXi H-b qkx’l) 

(5.13.3) 

i = 1,2,..., m + fc + 1 


This is a set of m + k + 1 linear equations for the unknown p’s and q’s, which can be 
solved by standard methods (e.g., LU decomposition). If we choose the xfs to all be in 
the interval (a, b ), then there will generically be an extremum between each chosen x, and 
Xi+i, plus also extrema where the function goes out of the interval at a and b, for a total 
of m + k + 2 extrema. For arbitrary s:,’s, the extrema will not have the same magnitude. 
The theorem says that, for one particular choice of xfs, the magnitudes can be beaten down 
to the identical, minimal, value of r. 

Instead of making f(xi) and R(xf) equal at the points x t , one can instead force the 
residual r(xi) to any desired values by solving the linear equations 



po +piXi H - 1- Pmx” 1 = \f(xi) - j/i]( 1 + qiXi H-h qux^) 


* = 1,2 ,... ,m + k + 1 


(5.13.4) 
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In fact, if the xC s are chosen to be the extrema (not the zeros) of the minimax solution, 
then the equations satisfied will be 


Po + PiXi H-F p m xT = [f{xi) ± r] (1 + qiXi 4- V qkXi) 


where the ± alternates for the alternating extrema. Notice that equation (5.13.5) is satisfied at 
m + k + 2 extrema, while equation (5.13.4) was satisfied only at m + k + 1 arbitrary points. 
How can this be? The answer is that r in equation (5.13.5) is an additional unknown, so that 
the number of both equations and unknowns is m + k + 2. True, the set is mildly nonlinear 
(in r), but in general it is still perfectly soluble by methods that we will develop in Chapter 9. 

We thus see that, given only the locations of the extrema of the minimax rational 
function, we can solve for its coefficients and maximum deviation. Additional theorems, 
leading up to the so-called Remes algorithms [1 ], tell how to converge to these locations by 
an iterative process. For example, here is a (slightly simplified) statement of Remes’ Second 
Algorithm: (1) Find an initial rational function with m + k + 2 extrema x, (not having equal 
deviation). (2) Solve equation (5.13.5) for new rational coefficients and r. (3) Evaluate the 
resulting R(x) to find its actual extrema (which will not be the same as the guessed values). 
(4) Replace each guessed value with the nearest actual extremum of the same sign. (5) Go 
back to step 2 and iterate to convergence. Under a broad set of assumptions, this method will 
converge. Ralston [1 ] fills in the necessary details, including how to find the initial set of Xi’s. 

Up to this point, our discussion has been textbook-standard. We now reveal ourselves 
as heretics. We don’t much like the elegant Remes algorithm. Its two nested iterations (on 
r in the nonlinear set 5.13.5, and on the new sets of Xi’s) are finicky and require a lot of 
special logic for degenerate cases. Even more heretical, we doubt that compulsive searching 
for the exactly best, equal deviation, approximation is worth the effort — except perhaps for 
those few people in the world whose business it is to find optimal approximations that get 
built into compilers and microchips. 

When we use rational function approximation, the goal is usually much more pragmatic: 
Inside some inner loop we are evaluating some function a zillion times, and we want to 
speed up its evaluation. Almost never do we need this function to the last bit of machine 
accuracy. Suppose (heresy!) we use an approximation whose error has m + k + 2 extrema 
whose deviations differ by a factor of 2. The theorems on which the Remes algorithms 
are based guarantee that the perfect minimax solution will have extrema somewhere within 
this factor of 2 range - forcing down the higher extrema will cause the lower ones to rise, 
until all are equal. So our “sloppy” approximation is in fact within a fraction of a least 
significant bit of the minimax one. 

That is good enough for us, especially when we have available a very robust method 
for finding the so-called “sloppy” approximation. Such a method is the least-squares solution 
of overdetermined linear equations by singular value decomposition (§2.6 and §15.4). We 
proceed as follows: First, solve (in the least-squares sense) equation (5.13.3), not just for 
m + k + 1 values of x % , but for a significantly larger number of Xi’s, spaced approximately 
like the zeros of a high-order Chebyshev polynomial. This gives an initial guess for R(x). 
Second, tabulate the resulting deviations, find the mean absolute deviation, call it r, and then 
solve (again in the least-squares sense) equation (5.13.5) with r fixed and the ± chosen to be 
the sign of the observed deviation at each point x t . Third, repeat the second step a few times. 

You can spot some Remes orthodoxy lurking in our algorithm: The equations we solve 
are trying to bring the deviations not to zero, but rather to plus-or-minus some consistent 
value. However, we dispense with keeping track of actual extrema; and we solve only linear 
equations at each stage. One additional trick is to solve a weighted least-squares problem, 
where the weights are chosen to beat down the largest deviations fastest. 

Here is a program implementing these ideas. Notice that the only calls to the function 
fn occur in the initial filling of the table f s. You could easily modify the code to do this filling 
outside of the routine. It is not even necessary that your abscissas xs be exactly the ones that we 
use, though the quality of the fit will deteriorate if you do not have several abscissas between 
each extremum of the (underlying) minimax solution. Notice that the rational coefficients are 
output in a format suitable for evaluation by the routine ratval in §5.3. 
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Figure 5.13.1. Solid curves show deviations r(x) for five successive iterations of the routine ratlsq 
for an arbitrary test problem. The algorithm does not converge to exactly the minimax solution (shown 
as the dotted curve). But, after one iteration, the discrepancy is a small fraction of the last significant 
bit of accuracy. 


#include <stdio.h> 

#include <math.h> 

#include "nrutil.h" 

#define NPFAC 8 
#define MAXIT 5 

#define PI02 (3.141592653589793/2.0) 
#define BIG 1.0e30 



void ratlsqfdouble (*fn)(double), double a, double b, int mm, int kk, 
double cof [], double *dev) 

Returns in cof [0. .trnn+kk] the coefficients of a rational function approximation to the function 
fn in the interval (a,b). Input quantities mm and kk specify the order of the numerator and 
denominator, respectively. The maximum absolute deviation of the approximation (insofar as 
is known) is returned as dev. 

{ 

double ratval(double x, double cof[], int mm, int kk); 

void dsvbksb(double **u, double w[] , double **v, int m, int n, double b[], 
double x [] ) ; 

void dsvdcmp(double **a, int m, int n, double w[], double **v); 

These are double versions of svdcmp, svbksb. 
int i,it,j,ncof,npt; 

double devmax,e,hth,power,sum,*bb,*coff,*ee,*fs,**u,**v,*w,*wt,*xs; 
ncof=mm+kk+l; 

npt=NPFAC*ncof; Number of points where function is evaluated, 

bb=dvector(l,npt); i.e., fineness of the mesh. 

coff=dvector(0,ncof-l); 

ee=dvector(l,npt); 

fs=dvector(l,npt); 

u=dmatrix(l,npt,1,ncof); 

v=dmatrix(1,ncof,1,ncof); 

w=dvector(1,ncof); 

wt=dvector(l,npt); 
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xs=dvector(l,npt); 

*dev=BIG; 

for (i=l;i<=npt;i++) { Fill arrays with mesh abscissas and function val- 

if (i < npt/2) { ues. 

hth=PI02*(i-l)/ (npt-1.0) ; At each end, use formula that minimizes round- 

xs[i]=a+(b-a)*DSQR(sin(hth)); off sensitivity. 

} else { 

hth=PI02*(npt-i)/(npt-1.0); 
xs[i]=b-(b-a)*DSQR(sin(hth)); 

> 

fs [i] = (*fn) (xs [i]); 

wt[i]=1.0; In later iterations we will adjust these weights to 

ee[i]=1.0; combat the largest deviations. 

> 

e=0.0; 

for (it=l;it<=MAXIT;it++) { Loop over iterations. 

for (i=l;i<=npt;i++) { Set up the "design matrix” for the least-squares 

power=wt [i] ; fit. 

bb [i] =power* (f s [i] +SIGN(e,ee [i] ) ) ; 

Key idea here: Fit to fn(r) + e where the deviation is positive, to fn(r) — e where 
it is negative. Then e is supposed to become an approximation to the equal-ripple 
deviation. 

for (j=l;j<=mm+l;j++) { 
u[i] [j]=power; 
power *= xs[i]; 

> 

power = -bb[i]; 
for (j=nmi+2;j<=ncof;j++) { 
power *= xs[i]; 
u[i] [j]=power; 

> 

} 

dsvdcmp(u,npt,ncof ,w,v) ; Singular Value Decomposition. 

In especially singular or difficult cases, one might here edit the singular values w[l. .ncof] , 
replacing small values by zero. Note that dsvbksb works with one-based arrays, so we 
must subtract 1 when we pass it the zero-based array coff. 
dsvbksb (u, w, v, npt, ncof, bb, cof f-1); 
devmax=sum=0.0; 

for (j=l;j<=npt; j++) { Tabulate the deviations and revise the weights, 

ee [j]=ratval(xs[j],coff,mm,kk)-f s[j]; 

wt [j]=fabs(ee [j]); Use weighting to emphasize most deviant points, 

sum += wt[j]; 

if (wt[j] > devmax) devmax=wt [j] ; 

> 

e=sum/npt; Update e to be the mean absolute deviation, 

if (devmax <= *dev) { Save only the best coefficient set found, 

for (j =0;j<ncof;j++) cof[j]=coff[j]; 

*dev=devmax; 

> 

printf(" ratlsq iteration= ’/,2d max error= ’/,10.3e\n" , it, devmax); 

> 

free_dvector(xs,1,npt); 
free_dvector(wt,l,npt); 
free_dvector(w,1,ncof); 
free_dmatrix(v,1,ncof,1,ncof); 
free_dmatrix(u,l,npt > l,ncof); 
free_dvector(fs,l,npt); 
free_dvector(ee,1,npt); 
free.dvector(coff,0,ncof-1); 
free_dvector(bb,1,npt); 
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Figure 5.13.1 shows the discrepancies for the first five iterations of ratlsq when it is 
applied to find the m = k rational fit to the function f(x) = cosx/(\ + e x ) in the 
interval (0,7r). One sees that after the first iteration, the results are virtually as good as the 
minimax solution. The iterations do not converge in the order that the figure suggests: In 
fact, it is the second iteration that is best (has smallest maximum deviation). The routine 
ratlsq accordingly returns the best of its iterations, not necessarily the last one; there is no 
advantage in doing more than five iterations. 


CITED REFERENCES AND FURTHER READING: 

Ralston, A. and Wilt, H.S. 1960, Mathematical Methods for Digital Computers (New York: Wiley), 
Chapter 13. [1] 


5.14 Evaluation of Functions by Path 
Integration 


In computer programming, the technique of choice is not necessarily the most 
efficient, or elegant, or fastest executing one. Instead, it may be the one that is quick 
to implement, general, and easy to check. 

One sometimes needs only a few, or a few thousand, evaluations of a special 
function, perhaps a complex valued function of a complex variable, that has many 
different parameters, or asymptotic regimes, or both. Use of the usual tricks (series, 
continued fractions, rational function approximations, recurrence relations, and so 
forth) may result in a patchwork program with tests and branches to different 
formulas. While such a program may be highly efficient in execution, it is often not 
the shortest way to the answer from a standing start. 

A different technique of considerable generality is direct integration of a 
function’s defining differential equation - an ab initio integration for each desired 
function value — along a path in the complex plane if necessary. While this may at 
first seem like swatting a fly with a golden brick, it turns out that when you already 
have the brick, and the fly is asleep right under it, all you have to do is let it fall! 

As a specific example, let us consider the complex hypergeometric func¬ 
tion 2 -fi (a,b,c;z), which is defined as the analytic continuation of the so-called 
hypergeometric series, 


iF\{a, 6, c; z) = 1 + — ^ 
c 1! 


a(a + 1)6(6 + 1) z 2 
c(c+l) 2[ + 


v a(a + 1) • ■ • (q + j ~ 1)6(6 + 1) ... (6 + j - 1) z j | _ 

c(c + 1)... (c + j — 1) j\ 

(5.14.1) 

The series converges only within the unit circle \z\ < 1 (see [1]), but one’s interest 
in the function is often not confined to this region. 

The hypergeometric function 2 -Pi is a solution (in fact the solution that is regular 
at the origin) of the hypergeometric differential equation, which we can write as 



z( 1 - z)F" =abF-[c-(a + b+ 1 )z\F' 


(5.14.2) 
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Here prime denotes d/dz. One can see that the equation has regular singular points 
at z = 0,1, and oo. Since the desired solution is regular at z = 0, the values 1 and 
oo will in general be branch points. If we want 2-Pi to be a single valued function, 
we must have a branch cut connecting these two points. A conventional position for 
this cut is along the positive real axis from 1 to 00 , though we may wish to keep 
open the possibility of altering this choice for some applications. 

Our golden brick consists of a collection of routines for the integration of sets 
of ordinary differential equations, which we will develop in detail later, in Chapter 
16. For now, we need only a high-level, “black-box” routine that integrates such 
a set from initial conditions at one value of a (real) independent variable to final 
conditions at some other value of the independent variable, while automatically 
adjusting its internal stepsize to maintain some specified accuracy. That routine is 
called odeint and, in one particular invocation, calculates its individual steps with 
a sophisticated Bulirsch-Stoer technique. 

Suppose that we know values for F and its derivative F' at some value zo, and 
that we want to find F at some other point z 1 in the complex plane. The straight-line 
path connecting these two points is parametrized by 

z(s) = z 0 + s(z! - Zo) (5.14.3) 

with s a real parameter. The differential equation (5.14.2) can now be written as 
a set of two first-order equations, 


dF 

ds 

dF' 

ds 





to be integrated from s = 0 to s = 1. Here F and F' are to be viewed as two 
independent complex variables. The fact that prime means d/dz can be ignored; it 
will emerge as a consequence of the first equation in (5.14.4). Moreover, the real and 
imaginary parts of equation (5.14.4) define a set of four real differential equations, 
with independent variable s. The complex arithmetic on the right-hand side can be 
viewed as mere shorthand for how the four components are to be coupled. It is 
precisely this point of view that gets passed to the routine odeint, since it knows 
nothing of either complex functions or complex independent variables. 

It remains only to decide where to start, and what path to take in the complex 
plane, to get to an arbitrary point z. This is where consideration of the function’s 
singularities, and the adopted branch cut, enter. Figure 5.14.1 shows the strategy 
that we adopt. For \z\ < 1/2, the series in equation (5.14.1) will in general converge 
rapidly, and it makes sense to use it directly. Otherwise, we integrate along a straight 
line path from one of the starting points (±1/2,0) or (0, ±1/2). The former choices 
are natural for 0 < Re (2) < 1 and Re(z) < 0, respectively. The latter choices are 
used for Re(,z) > 1, above and below the branch cut; the purpose of starting away 
from the real axis in these cases is to avoid passing too close to the singularity at 
z = 1 (see Figure 5.14.1). The location of the branch cut is defined by the fact that 
our adopted strategy never integrates across the real axis for Re 0?) > 1- 

An implementation of this algorithm is given in §6.12 as the routine hypgeo. 
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Figure 5.14.1. Complex plane showing the singular points of the hypergeometric function, its branch 
cut, and some integration paths from the circle |z|; =»'l/2 (where the power series converges rapidly) 
to other points in the plane. 

A number of variants on the procedure described thus far are possible, and easy 
to program. If successively called values of z are close together (with identical values 
of o, b, and c), then you can save the state vector (F, F') and the corresponding value 
of 2 on each call, and use these as starting values for the next call. The incremental 
integration may then take only one or two steps. Avoid integrating across the branch 
cut unintentionally: the function value will be “correct,” but not the one you want. 

Alternatively, you may wish to integrate to some position z by a dog-leg path 
that does cross the real axis Re 2 > 1, as a means of moving the branch cut. For 
example, in some cases you might want to integrate from (0,1/2) to (3/2,1/2), 
and go from there to any point with Re z > 1 — with either sign of Im 2 . (If 
you are, for example, finding roots of a function by an iterative method, you do 
not want the integration for nearby values to take different paths around a branch 
point. If it does, your root-finder will see discontinuous function values, and will 
likely not converge correctly!) 

In any case, be aware that a loss of numerical accuracy can result if you integrate 
through a region of large function value on your way to a final answer where the 
function value is small. (For the hypergeometric function, a particular case of this is 
when a and b are both large and positive, with c and x ^ 1.) In such cases, you’ll 
need to find a better dog-leg path. 

The general technique of evaluating a function by integrating its differential 
equation in the complex plane can also be applied to other special functions. For 
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example, the complex Bessel function, Airy function. Coulomb wave function, and 
Weber function are all special cases of the confluent hypergeometric function, with a 
differential equation similar to the one used above (see, e.g., [1 ] §13.6, for a table of 
special cases). The confluent hypergeometric function has no singularities at finite 2 : 
That makes it easy to integrate. However, its essential singularity at infinity means 
that it can have, along some paths and for some parameters, highly oscillatory or 
exponentially decreasing behavior: That makes it hard to integrate. Some case by 
case judgment (or experimentation) is therefore required. 

CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York). [1] 
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Chapter 6. Special Functions 

6.0 Introduction 


There is nothing particularly special about a special function, except that 
some person in authority or textbook writer (not the same thing!) has decided to 
bestow the moniker. Special functions are sometimes called higher transcendental 
functions (higher than what?) or functions of mathematical physics (but they occur in 
other fields also) or functions that satisfy certain frequently occurring second-order 
differential equations (but not all special functions do). One might simply call them 
“useful functions” and let it go at that; it is surely only a matter of taste which 
functions we have chosen to include in this chapter. 

Good commercially available program libraries, such as NAG or IMSL, contain 
routines for a number of special functions. These routines are intended for users who 
will have no idea what goes on inside them. Such state of the art “black boxes” are 
often very messy things, full of branches to completely different methods depending 
on the value of the calling arguments. Black boxes have, or should have, careful 
control of accuracy, to some stated uniform precision in all regimes. 

We will not be quite so fastidious in our examples, in part because we want 
to illustrate techniques from Chapter 5, and in part because we want you to 
understand what goes on in the routines presented. Some of our routines have an 
accuracy parameter that can be made as small as desired, while others (especially 
those involving polynomial fits) give only a certain accuracy, one that we believe 
serviceable (typically six significant figures or more). We do not certify that the 
routines are perfect black boxes. We do hope that, if you ever encounter trouble 
in a routine, you will be able to diagnose and correct the problem on the basis of 
the information that we have given. 

In short, the special function routines of this chapter are meant to be used — 
we use them all the time — but we also want you to be prepared to understand 
their inner workings. 

CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. f964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York) [full of useful numerical approximations to a great variety 
of functions]. 

IMSL Sfun/Library Users Manual (IMSL Inc., 2500 CityWest Boulevard, Houston TX 77042). 
NAG Fortran Library (Numerical Algorithms Group, 256 Banbury Road, Oxford OX27DE, U.K.), 
Chapter S. 
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Hart, J.F., et al. 1968, Computer Approximations (New York: Wiley). 

Hastings, C. 1955, Approximations for Digital Computers (Princeton: Princeton University Press). 
Luke, Y.L. 1975, Mathematical Functions and Their Approximations (New York: Academic Press). 


6.1 Gamma Function, Beta Function, Factorials, 
Binomial Coefficients 

The gamma function is defined by the integral 

r(z) = f t z ~ 1 e~ t dt (6.1.1) 

Jo 

When the argument z is an integer, the gamma function is just the familiar factorial 
function, but offset by one, 


n\ = T(n + 1) (6.1.2) 

The gamma function satisfies the recurrence relation 

T(z + \)=zT(z) (6.1.3) 


If the function is known for arguments z > 1 or, more generally, in the half complex 
plane Re(z) > 1 it can be obtained for z < 1 or Re (z) < 1 by the reflection formula 

^ ^ r(z)sin(7rz) T(1 + z) sin(Trz) ^ ^ 

Notice that T(z) has a pole at z = 0, and at all negative integer values of z. 

There are a variety of methods in use for calculating the function T(z) 
numerically, but none is quite as neat as the approximation derived by Lanczos [1]. 
This scheme is entirely specific to the gamma function, seemingly plucked from 
thin air. We will not attempt to derive the approximation, but only state the 
resulting formula: For certain integer choices of 7 and N, and for certain coefficients 
Ci,c -2 .cjv, the gamma function is given by 


r(z +1) = (z + 7 + § )*+§ e -(*+7+D 


: \Fhr I c 0 + ° 2 


+ 2 + "' + z + n 


+ e 


(z > 0) 


(6.1.5) 


You can see that this is a sort of take-off on Stirling’s approximation, but with a 
series of corrections that take into account the first few poles in the left complex 
plane. The constant Co is very nearly equal to 1. The error term is parametrized by e. 
For 7 = 5, N = 6 , and a certain set of c’s, the error is smaller than |e| < 2 x 10 _1 °. 
Impressed? If not, then perhaps you will be impressed by the fact that (with these 
same parameters) the formula (6.1.5) and bound on e apply for the complex gamma 
function, everywhere in the half complex plane Re z > 0. 
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It is better to implement In T(a:) than T(x), since the latter will overflow many 
computers’ floating-point representation at quite modest values of x. Often the 
gamma function is used in calculations where the large values of T(a;) are divided by 
other large numbers, with the result being a perfectly ordinary value. Such operations 
would normally be coded as subtraction of logarithms. With (6.1.5) in hand, we can 
compute the logarithm of the gamma function with two calls to a logarithm and 25 
or so arithmetic operations. This makes it not much more difficult than other built-in 
functions that we take for granted, such as sin a; or e x : 

#include <math.h> 

float gammln(float xx) 

Returns the value ln[r(xx)] for xx > 0. 

{ 

Internal arithmetic will be done in double precision, a nicety that you can omit if five-figure 
accuracy is good enough, 
double x,y,tmp,ser; 

static double cof[6]={76.18009172947146,-86.50532032941677, 

24.01409824083091,-1.231739572450155, 

0.1208650973866179e-2, -0.5395239384953e-5>; 
int j; 


y=x=xx; 
tmp=x+5.5; 

tmp -= (x+0.5)*log(tmp); 

ser=l.000000000190015; 

for (j=0;j<=5;j++) ser += cof[j]/++y; 

return -tmp+log(2.5066282746310005*ser/x); 


How shall we write a routine for the factorial function n!? Generally the 
factorial function will be called for small integer values (for large values it will 
overflow anyway!), and in most applications the same integer value will be called for 
many times. It is a profligate waste of computer time to call exp (gammln (n+1.0) ) 
for each required factorial. Better to go back to basics, holding gammln in reserve 
for unlikely calls: 

#include <math.h> 
float factrl(int n) 

Returns the value n! as a floating-point number. 

i 

float gammln(float xx); 

void nrerror(char error_text[]); 

static int ntop=4; 

static float a[33]={l.0,1.0,2.0,6.0,24.0}; Fill in table only as required, 

int j ; 

if (n < 0) nrerror("Negative factorial in routine factrl"); 
if (n > 32) return exp(gammln(n+1.0)); 

Larger value than size of table is required. Actually, this big a value is going to overflow 
on many computers, but no harm in trying. 

while (ntop<n) { Fill in table up to desired value. 

j=ntop++; 

a[ntop]=a[j]*ntop; 

> 

return a[n]; 



S o- i 
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A useful point is that f actrl will be exact for the smaller values of n, since 
floating-point multiplies on small integers are exact on all computers. This exactness 
will not hold if we turn to the logarithm of the factorials. For binomial coefficients, 
however, we must do exactly this, since the individual factorials in a binomial 
coefficient will overflow long before the coefficient itself will. 

The binomial coefficient is defined by 



n! 

k\(n — k)\ 


0 < k < n 


6.1.6 


#include <math.h> 
float bico(int n, int k) 

Returns the binomial coefficient (£) as a floating-point number, 
float factln(int n); 

return floor(0.5+exp(factln(n)-factln(k)-factln(n-k))); 

The floor function cleans up roundoff error for smaller values of n and k. 


which uses 

float factlnfint n) 

Returns ln(nl). 

{ 

float ganmiln (float xx); 

void nrerrorfchar error_text []); 

static float a[101] ; A static array is automatically initialized to zero. 

if (n < 0) nrerror("Negative factorial in routine factln"); 
if (n <= 1) return 0.0; 

if (n <= 100) return a[n] ? a[n] : (a[n] =gammln(n+l .0)); In range of table, 
else return gammln(n+l.0) ; Out of range of table. 


If your problem requires a series of related binomial coefficients, a good idea 
is to use recurrence relations, for example 


( n \ n— k / n\ 

k+l) ' k I \k) 


(6.1.7) 



Finally, turning away from the combinatorial functions with integer valued 
arguments, we come to the beta function, 


B(z,w) = B(w,z) 


/ 1 (1 — t) w 1 dt 


( 6 . 1 . 8 ) 
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which is related to the gamma function by 


B(z, w) 


r(*)i» 
r(z + n;) 


(6.1.9) 


hence 


#include <math.h> 

float beta(float z, float w) 

Returns the value of the beta function B(z,w). 

{ 

float gammln(float xx); 

return exp(gaimnln(z)+gaimnln(w)-gammln(z+w)); 

> 


CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), Chapter 6. 

Lanczos, C. 1964, SIAM Journal on Numerical Analysis, ser. B, vol. 1, pp. 86-96. [1] 


6.2 Incomplete Gamma Function, Error 

Function, Chi-Square Probability Function, 
Cumulative Poisson Function 

The incomplete gamma function is defined by 

P(a,x) = [ e~ t t a ~ 1 dt (a > 0) (6.2.1) 

T(a) T(a) Jo 

It has the limiting values 

P(a,0) = 0 and P(a,oo) = 1 ( 6 . 2 . 2 ) 



E-S3 

The incomplete gamma function P(o, x) is monotonic and (for a greater than one or S' 3 . 

so) rises from “near-zero” to “near-unity” in a range of x centered on about a — 1 , ® ® 

and of width about y/a (see Figure 6.2.1). 

The complement of P(a, x) is also confusingly called an incomplete gamma 
function, 




(a > 0) (6.2.3) 
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It has the limiting values 

Q(a,0) = 1 and Q(a, oo) = 0 (6.2.4) 

The notations P(a,x),'y(a,x), and F(a, x) are standard; the notation Q(a,x) is 
specific to this book. 

There is a series development for 7 (a, x) as follows: 


7 (a, x) = e 


1 V a 

^T(a+l + n) 


One does not actually need to compute a new T(a + 1 + n) for each n; one rather 
uses equation (6.1.3) and the previous coefficient. 

A continued fraction development for T(a, x) is 


T(a, x) = e X x a 


/ 1 1 — a 1 

+ 1 + x H 



(x > 0 ) ( 6 . 2 . 6 ) 


It is computationally better to use the even part of (6.2.6), which converges twice 
as fast (see §5.2): 


L (a, x) = e x 


1 1 • (1 - a) 2 • (2 - a) 

L— a— x+3—a— x+5—a— 


(x > 0 ) 
(6.2.7) 



It turns out that (6.2.5) converges rapidly for x less than about a + 1, while 
(6.2.6) or (6.2.7) converges rapidly for x greater than about a + 1. In these respective 
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regimes each requires at most a few times y'a terms to converge, and this many 
only near x = a, where the incomplete gamma functions are varying most rapidly. 
Thus (6.2.5) and (6.2.7) together allow evaluation of the function for all positive 
a and x. An extra dividend is that we never need compute a function value near 
zero by subtracting two nearly equal numbers. The higher-level functions that return 
P(a,x) and Q(a,x) are 

float gammp(float a, float x) 

Returns the incomplete gamma function P(a,x). 

{ 

void gcf(float *gammcf, float a, float x, float *gln); 
void gser(float *gamser, float a, float x, float *gln); 
void nrerror(char error_text[]); 
float gamser.gammcf,gin; 

if (x < 0.0 || a <= 0.0) nrerror("Invalid arguments in routine gammp"); 
if (x < (a+1.0)) { Use the series representation, 

gser (fegamser, a, x, &gln); 
return gamser; 

> else { Use the continued fraction representation 

gcf(fegammcf,a,x,&gln); 

return 1.0-gammcf; and take its complement. 

> 

> 


float gammq(float a, float x) 

Returns the incomplete gamma function Q(a,x ) si — P(a,x). 

{ 

void gcf(float *gammcf, float a, float x, float *gln); 
void gser(float *gamser, float a, float x, float *gln); 
void nrerror(char error_text[] ); 
float gamser.gammcf,gin; 

if (x < 0.0 || a <= 0.0) nrerror("Invalid arguments in routine gammq"); 
if (x < (a+1.0)) { Use the series representation 

gser(fegamser,a,x,&gln); 

return 1.0-gamser; and take its complement. 

> else { Use the continued fraction representation. 

gcf(fegammcf,a,x,&gln); 
return gammcf; 

> 

> 


The argument gin is set by both the series and continued fraction procedures 
to the value In T(a); the reason for this is so that it is available to you if you want to 
modify the above two procedures to give 7 (a, x) and T(a, x), in addition to P(a, x) 
and Q(a,x) (cf. equations 6.2.1 and 6.2.3). 

The functions gser and gcf which implement (6.2.5) and (6.2.7) are 

#include <math.h> 

#define ITMAX 100 
#define EPS 3.0e-7 

void gser(float *gamser, float a, float x, float *gln) 

Returns the incomplete gamma function P(a, x) evaluated by its series representation as gamser. 
Also returns lnr(o) as gin. 

{ 



float gammln(float xx); 
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void nrerror(char error_text []); 
int n; 

float sum,del,ap; 

*gln=gammln(a); 
if (x <= 0.0) { 

if (x < 0.0) nrerrorC'x less than 0 in routine gser"); 

*gamser=0.0; 

return; 

> else { 
ap=a; 

del=sum=1.0/a; 

for (n=l;n<=ITMAX;n++) { 

++ap; 

del *= x/ap; 
sum += del; 

if (fabs(del) < fabs(sum)*EPS) { 

*gamser=sum*exp(-x+a*log(x)-(*gln)); 
return; 

> 

> 

nrerrorC'a too large, UMAX too small in routine gser"); 
return; 

} 

> 


#include <math.h> 

#define UMAX 100 Maximum allowed number of iterations. 

#define EPS 3.0e-7 Relative accuracy. 

#define FPMIN 1.0e-30 Number near the smallest representable 

floating-point number. 

void gcf(float *gammcf, float a, float x, float *gln) 

Returns the incomplete gamma function Q(a, x) evaluated by its continued fraction represen¬ 
tation as gammcf. Also returns lnT(a) as gin. 

{ 


float gammln(float xx); 

void nrerror(char error_text []); 


int i; 

float an,b,c,d,del,h; 


*gln=gammln(a); 
b=x+1.0-a; 
c=l.0/FPMIN; 
d=l.0/b; 
h=d; 

for (i=l; i<=ITMAX; i++) { 
an = -i*(i-a); 
b += 2.0; 
d=an*d+b; 

if (fabs(d) < FPMIN) d=FPMIN 
c=b+an/c; 

if (fabs(c) < FPMIN) c=FPMIN 
d=l.0/d; 
del=d*c; 
h *= del; 


Set up for evaluating continued fraction 
by modified Lentz's method (§5.2) 
with bo ==£ %. 

Iterate to convergence. 


if (fabs(del-1.0) < EPS) break; 

> 

if (i > UMAX) nrerrorC'a too large, UMAX too small in gcf"); 
*gammcf=exp(-x+a*log(x)-(*gln))*h; Put factors in front. 



> 
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Error Function 


The error function and complementary error function are special cases of the 
incomplete gamma function, and are obtained moderately efficiently by the above 
procedures. Their definitions are 


and 


erf (a;) = - 

V 71 Jo 

erfcfa;) = 1 — erffa:) = -^= f e~ r 
V TT J x 


e 1 dt 
2 


dt 


The functions have the following limiting values and symmetries: 

erf(0) = 0 erf(oo) = 1 erf(—a;) = —erf(a:) 
erfc(0) = 1 erfc(oo) = 0 erfc(—a;) — 2 — erfc(a;) 
They are related to the incomplete gamma functions by 


and 


erf(a:) = Pl-,x : 


erfc(a:) = x‘ 


(a; > 0) 
(x > 0 ) 


( 6 . 2 . 8 ) 

(6.2.9) 

( 6 . 2 . 10 ) 

( 6 . 2 . 11 ) 

( 6 . 2 . 12 ) 

(6.2.13) 


We’ll put an extra “f” into our routine names to avoid conflicts with names already 
in some C libraries: 


float erff(float x) 

Returns the error function erf(x). 

{ 

float ganmipCfloat a, float x); 

return x < 0.0 ? -gammp(0.5,x*x) : gammp(0.5,x*x); 

> 


float erffc(float x) 

Returns the complementary error function erfc(x). 

{ 

float ganmipCf loat a, float x); 
float gammq(float a, float x); 

return x < 0.0 ? 1.0+gammp(0.5,x*x) : gaimnq(0.5,x*x); 

> 



If you care to do so, you can easily remedy the minor inefficiency in erf f and 
erf f c, namely that T(0.5) = s/tt is computed unnecessarily when gammp or gammq 
is called. Before you do that, however, you might wish to consider the following 
routine, based on Chebyshev fitting to an inspired guess as to the functional form: 
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#include <math.h> 
float erfcc(float x) 

Returns the complementary error function erfc(x) with fractional error everywhere less than 
1.2 X 10“ 7 . 

{ 

float t,z,ans; 

z=fabs(x); 

t=l.0/(1.0+0.5*z); 

ans=t*exp(-z*z-l.26551223+t*(1.00002368+t*(0.37409196+t*(0.09678418+ 
t*(-0.18628806+t*(0.27886807+t*(-1.13520398+t*(1.48851587+ 
t* (-0.82215223+t*0.17087277))))))))); 
return x >= 0.0 ? ans : 2.0-ans; 


There are also some functions of two variables that are special cases of the 
incomplete gamma function: 

Cumulative Poisson Probability Function 

P x {< k), for positive x and integer k > 1, denotes the cumulative Poisson 
probability function. It is defined as the probability that the number of Poisson 
random events occurring will be between 0 and k — 1 inclusive, if the expected mean 
number is x. It has the limiting values 

P x (< 1) = e~ x P x (< oo) = 1 (6.2.14) 

Its relation to the incomplete gamma function is simply 

P x {< k) = Q(k,x) = gammq(/c,a;) (6.2.15) 


Chi-Square Probability Function 

P(x 2 \v) is defined as the probability that the observed chi-square for a correct 
model should be less than a value % 2 . (We will discuss the use of this function in 
Chapter 15.) Its complement Q(x 2 \v) is the probability that the observed chi-square 
will exceed the value x 2 by chance even for a correct model. In both cases v is an 
integer, the number of degrees of freedom. The functions have the limiting values 


P(0|i/) = 0 P{oo\u) = l (6.2.16) 

Q( 0|i/) = l Q(oc\v) = 0 (6.2.17) 

and the following relation to the incomplete gamma functions, 

Q(x 2 \v) = q(^, y) = gammq (^, (6.2.19) 



s o- i 
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CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), Chapters 6, 7, and 26. 

Pearson, K. (ed.) 1951, Tables of the Incomplete Gamma Function (Cambridge: Cambridge 
University Press). 


6.3 Exponential Integrals 


The standard definition of the exponential integral is 

f°° e~ xt 

E n (x)= - dt , a; > 0, n = 0,l,... (6.3.1) 

J i t n 

The function defined by the principal value of the integral 

Ei(a-) = - j = J jdt, x>0 (6.3.2) 

is also called an exponential integral. Note that Ei(— x) is related to —Ej (x) by 
analytic continuation. 

The function E n (x) is a special case of the incomplete gamma function 

E n (x) = x" 'T(4 - n, x) (6.3.3) 


We can therefore use a similar strategy for evaluating it. The continued fraction - 
just equation (6.2.6) rewritten — converges for all x > 0: 


E n (x) = e x 


1 n 1 n +1 2 

C+1+X+ 1+ x + 


We use it in its more rapidly converging even form, 

1 1 • n 2(n +1) 


E n (x) = e 


c + n — x + n + 2— x + n +4 — 


(6.3.4) 


(6.3.5) 


The continued fraction only really converges fast enough to be useful for x 1. 
For 0 < x 1, we can use the series representation 


E n {x) = 


{-xY - 1 

(«-!)! 


[-In x + ip(n)\- 


( ~x) m 

(m — n+ l)m! 


6.3.6 



The quantity tp(ri) here is the digamma function, given for integer arguments by 


^( 1 ) = - 7 , 


ip(n) 


-7+ 


1 

m 


(6.3.7) 
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where 7 = 0.5772156649 ... is Euler’s constant. We evaluate the expression (6.3.6) 
in order of ascending powers of x : 


E n {x) = - 


1 


(~x) n 


(1 — n) (2 — n) • 1 (3 — n)(l • 2) 


(-!){«- 2)1 


+ (/ii), [-1 nx + i>(n)\- 


(_ x) n (—a;) n+1 

1 -n! 2 - (n+ 1 )! 


6.3.8 


The first square bracket is omitted when n = 1. This method of evaluation has the 
advantage that for large n the series converges before reaching the term containing 
ip(ri). Accordingly, one needs an algorithm for evaluating tp(n) only for small n, 
n ^ 20 - 40. We use equation (6.3.7), although a table look-up would improve 
efficiency slightly. 

Amos [1 ] presents a careful discussion of the truncation error in evaluating 
equation (6.3.8), and gives a fairly elaborate termination criterion. We have found 
that simply stopping when the last term added is smaller than the required tolerance 
works about as well. 

Two special cases have to be handled separately: 


Eo(x) = 

E n { 0) = ' n > 1 

n— 1 


(6.3.9) 


The routine expint allows fast evaluation of E n (x) to any accuracy EPS 
within the reach of your machine’s word length for floating-point numbers. The 
only modification required for increased accuracy is to supply Euler’s constant with 
enough significant digits. Wrench [2] can provide you with the first 328 digits if 
necessary! 


#include <math.h> 

#define MAXIT 100 
#define EULER 0.5772156649 
#define FPMIN 1.0e-30 
#define EPS 1.0e-7 


Maximum allowed number of iterations. 

Euler's constant 7 . 

Close to smallest representable floating-point number. 
Desired relative error, not smaller than the machine pre¬ 
cision. 


float expint(int n, float x) 

Evaluates the exponential integral E„(x). 
{. 


void nrerrorfchar error_text[]); 
int i,ii,nml; 


float a,b,c,d,del,fact,h,psi,ans; 


nml=n-l; 

if (n < 0 II x < 0.0 II (x==0.0 kk (n==0 I I n==l))) 
nrerror("bad arguments in expint"); 
else { 

if (n == 0) ans=exp(-x)/x; 
else { 

if (x == 0.0) ans=1.0/nml; 



else { 


Special case. 

Another special case. 
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if (x > 1.0) { Lentz’s algorithm (§5.2). 

b=x+n; 

c=l.0/FPMIN; 
d=l,0/b; 


> 


for (i=l;i<=MAXIT;i++) { 
a = -i*(nml+i); 
b += 2.0; 

d=1.0/(a*d+b); Denominators cannot be zero. 

c=b+a/c; 
del=c*d; 
h *= del; 

if (fabs(del-1.0) < EPS) { 
ans=h*exp(-x); 
return ans; 

> 

> 

nrerror("continued fraction failed in expint"); 

> else { Evaluate series. 

ans = (nml!=0 ? 1.0/nml : -log(x)-EULER); Set first term. 

fact=1.0; 

for (i=l;i<=MAXIT;i++) { 
fact *= -x/i; 

if (i != nml) del = -fact/(i-nml); 
else { 

psi = -EULER; Compute ip(n). 

for (ii=l;ii<=nml;ii++) psi += 1.0/ii; 
del=fact*(-log(x)+psi); 

> 

ans += del; 

if (fabs(del) < fabs(ans)*EPS) return ans; 

> 

nrerror("series failed in expint"); 

> 

> 

> 

> 

return ans; 


A good algorithm for evaluating Ei is to use the power series for small x and 
the asymptotic series for large x. The power series is 

Ei(z) =7 + lnx+ y ~[{ + 2 ^2! + '“ (6.3.10) 

where 7 is Euler’s constant. The asymptotic expansion is 

Ei(x)~ — fo- + 4 + ---) (6.3.H) 

X \ XX 1 ) 

The lower limit for the use of the asymptotic expansion is approximately | In EPS |, 
where EPS is the required relative error. 
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#include <math.h> 

#define EULER 0.57721566 
#define MAXIT 100 
#define FPMIN 1.0e-30 
#define EPS 6.0e-8 


Euler's constant 7 . 

Maximum number of iterations allowed. 

Close to smallest representable floating-point number. 
Relative error, or absolute error near the zero of Ei at 


float ei(float x) 

Computes the exponential integral Ei(x) for x > 0. 

{ 

void nrerror(char error_text[]); 
int k; 

float fact,prev,sum,term; 


if (x <= 0.0) nrerror("Bad argument in ei"); 

if (x < FPMIN) return log(x)+EULER; Special case: avoid failure of convergence 

if (x <= -log(EPS)) { test because of underflow. 

sum=0.0; Use power series. 

fact=1.0; 

for (k=l;k<=MAXIT;k++) { 
fact *= x/k; 
term=fact/k; 
sum += term; 

if (term < EPS*sum) break; 

> 

if (k > MAXIT) nrerrorC'Series failed in ei"); 
return sum+log(x)+EULER; 

> else { Use asymptotic series. 

sum=0.0; Start with second term. 

term=l.0; 

for (k=l;k<=MAXIT;k++) { 
prev=term; 
term *= k/x; 
if (term < EPS) break; 

Since final sum is greater than one, term itself approximates the relative error, 
if (term < prev) sum += term; Still converging: add new term, 

else { 

sum -= prev; Diverging: subtract previous term and 

break; exit. 

> 

> 

return exp(x)*(l.0+sum)/x; 

> 

> 


CITED REFERENCES AND FURTHER READING: 

Stegun, I.A., and Zucker, R. 1974, Journal of Research of the National Bureau of Standards, 
vol. 78B, pp. 199-216; 1976, op. cit., vol. 80B, pp. 291-311. 

Amos D.E. 1980, ACM Transactions on Mathematical Software, vol. 6, pp. 365-377 [1]; also 
vol. 6, pp. 420-428. 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), Chapter 5. 

Wrench J.W. 1952, Mathematical Tables and Other Aids to Computation, vol. 6, p. 255. [2] 
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0 .2 .4 .6 .8 


Figure 6.4.1. The incomplete beta function I x (a, b) for five different pairs of (a, b). Notice that the 
pairs (0.5, 5.0) and (5.0, 0.5) are symmetrically related as indicated in equation (6.4.3). 

6.4 Incomplete Beta Function, Student’s 
Distribution, F-Distribution, Cumulative 
Binomial Distribution 

The incomplete beta function is defined by 

{a,b> 0) (6.4.1) 

It has the limiting values 

Jo (<»,&)= 0 Ji(o,6) = l (6.4.2) 

and the symmetry relation 



I x (a,b) = 1-h- x (b,a) (6.4.3) 

If a and b are both rather greater than one, then I x (a, b ) rises from “near-zero” to 
“near-unity” quite sharply at about x = a/(a + b). Figure 6.4.1 plots the function 
for several pairs (a,b). 
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The incomplete beta function has a series expansion 

(6.4.4) 

but this does not prove to be very useful in its numerical evaluation. (Note, however, 
that the beta functions in the coefficients can be evaluated for each value of n with 
just the previous value and a few multiplies, using equations 6.1.9 and 6.1.3.) 

The continued fraction representation proves to be much more useful, 


4 (a, 6) = ■ 


B(a + 6, n + 1) 


where 


4 (a, 6) 


x a (l — x) b 1 di d ,2 
aB(a, b) 1+ 1+ 1+ 


(a + m)(a + b + m)x 
(a + 2m) (a + 2 to + 1) 
m(b — m)x 
(a + 2 rn — l)(a + 2m) 


(6.4.5) 

(6.4.6) 


This continued fraction converges rapidly for x < (a + 1 )/(a + 6+2), taking in 
the worst case '{$s/ max(a. 6)) iterations. But for a; > (a + l)/(a + 6 + 2) we can 
just use the symmetry relation (6.4.3) to obtain an equivalent computation where the 
continued fraction will also converge rapidly. Hence we have 


#include 


<math.h> 


float betai(float a, float b, float x) 
Returns the incomplete beta function 7x(a,b). 

{ 

float betacf(float a, float b, float x); 
float gammln(float xx); 
void nrerror(char error_text[]); 
float bt; 


if (x < 0.0 I I x > 1.0) nrerror("Bad x in routine betai"); 
if (x == 0.0 II x == 1.0) bt=0.0; 

else Factors in front of the continued fraction. 

bt=exp(gammln(a+b)-gammln(a)-gammln(b)+a*log(x)+b*log(l.0-x)); 
if (x < (a+1.0)/(a+b+2.0)) Use continued fraction directly, 

return bt*betacf(a,b,x)/a; 

else Use continued fraction after making the sym- 

return 1.0-bt*betacf (b,a, 1.0-x)/b; metry transformation. 


which utilizes the continued fraction evaluation routine 

#include <math.h> 

#define MAXIT 100 
#define EPS 3.0e-7 
#define FPMIN 1.0e-30 

float betacf(float a, float b, float x) 

Used by betai: Evaluates continued fraction for incomplete beta function by modified Lentz's 
method (§5.2). 

{ 

void nrerror(char error_text[]); 
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int m,m2; 

float aa,c,d,del,h,qab,qam,qap; 

qab=a+b; These q's will be used in factors that occur 

qap=a+1.0; in the coefficients (6.4.6). 

qam=a-1.0; 

c=1.0; First step of Lentz's method. 

d=l.0-qab*x/qap; 

if (fabs(d) < FPMIN) d=FPMIN; 

d=l.0/d; 

h=d; 

for (m=l;m<=MAXIT;m++) { 
m2=2*m; 

aa=m*(b-m)*x/((qam+m2)*(a+m2)); 

d=1.0+aa*d; One step (the even one) of the recurrence, 

if (fabs(d) < FPMIN) d=FPMIN; 
c=l.0+aa/c; 

if (fabs(c) < FPMIM) c=FPMIN; 
d=l,0/d; 
h *= d*c; 

aa = -(a+m)*(qab+m)*x/((a+m2)*(qap+m2)); 

d=1.0+aa*d; Next step of the recurrence (the odd one), 

if (fabs(d) < FPMIN) d=FPMIN; 
c=l.0+aa/c; 

if (fabs(c) < FPMIN) c=FPMIN; 
d=l.0/d; 


h *= del; 

if (fabs(del-1.0) < EPS) break; Are we done? 

> 

if (m > MAXIT) nrerrorO'a or b too big, or MAXIT too small in betacf"); 
return h; 


Student’s Distribution Probability Function 

Student’s distribution, denoted A(t\u), is useful in several statistical contexts, 
notably in the test of whether two observed distributions have the same mean. A{t\v) 
is the probability, for v degrees of freedom, that a certain statistic t (measuring 
the observed difference of means) would be smaller than the observed value if the 
means were in fact the same. (See Chapter 14 for further details.) Two means are 
significantly different if, e.g., A{t\v) > 0.99. In other words, 1 — A{t\v) is the 
significance level at which the hypothesis that the means are equal is disproved. 

The mathematical definition of the function is 



(6.4.7) 


Limiting values are 


A(0\u) = 0 A{oo\v) = 1 


(6.4.8) 


A{t\v) is related to the incomplete beta function I x (a, b ) by 


A{t\v) 



(6.4.9) 



Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 



6.4 Incomplete Beta Function 


229 


So, you can use (6.4.9) and the above routine betai to evaluate the function. 


F-Distribution Probability Function 


This function occurs in the statistical test of whether two observed samples 
have the same variance. A certain statistic F, essentially the ratio of the observed 
dispersion of the first sample to that of the second one, is calculated. (For further 
details, see Chapter 14.) The probability that F would be as large as it is if the first 
sample’s underlying distribution actually has smaller variance than the second’s is 
denoted Q(F\ui, v 2 ), where v\ and v 2 are the number of degrees of freedom in the 
first and second samples, respectively. In other words, Q(F\ui , u 2 ) is the significance 
level at which the hypothesis “1 has smaller variance than 2” can be rejected. A 
small numerical value implies a very significant rejection, in turn implying high 
confidence in the hypothesis “1 has variance greater or equal to 2.” 

Q(F\v\, u 2 ) has the limiting values 

Q(0\ui, u 2 ) = 1 < 2 ( 00 ^ 1 , z/ 2 ) = 0 (6.4.10) 

Its relation to the incomplete beta function I x (a, b ) as evaluated by betai above is 


Q(F\ui, u 2 ) = I 

"2 + vi F 


V2 

2 ’ 2 ) 


(6.4.11) 


Cumulative Binomial Probability Distribution 

Suppose an event occurs with probability p per trial. Then the probability P of 
its occurring k or more times in n trials is termed a cumulative binomial probability, 
and is related to the incomplete beta function I x (a, b ) as follows: 

P = E ("V(! - PT~ j = n-k+1) (6.4.12) 

i=fc 

For n larger than a dozen or so, betai is a much better way to evaluate the sum in 
(6.4.12) than would be the straightforward sum with concurrent computation of the 
binomial coefficients. (For n smaller than a dozen, either method is acceptable.) 


CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), Chapters 6 and 26. 

Pearson, E., and Johnson, N. 1968, Tables of the Incomplete Beta Function (Cambridge: Cam¬ 
bridge University Press). 
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6.5 Bessel Functions of Integer Order 


This section and the next one present practical algorithms for computing various 
kinds of Bessel functions of integer order. In §6.7 we deal with fractional order. In 
fact, the more complicated routines for fractional order work fine for integer order 
too. For integer order, however, the routines in this section (and §6.6) are simpler 
and faster. Their only drawback is that they are limited by the precision of the 
underlying rational approximations. For full double precision, it is best to work with 
the routines for fractional order in §6.7. 

For any real u, the Bessel function J v {x) can be defined by the series 
representation 




k\T(u + k + 1) 


The series converges for all x, but it is not computationally very useful for x 1. 
For v not an integer the Bessel function Y v (x) is given by 

V' r..r M x ) cos(^tt) - J- V {x) , 


The right-hand side goes to the correct limiting value Y n (a;) as v goes to some integer 
n, but this is also not computationally useful. 

For arguments x < v, both Bessel functions look qualitatively like simple 
power laws, with the asymptotic forms for 0 < x -C v 


Mx) 

~ r 

1 ^(ix) i/>o 

( v + 1) \2 J 

Y 0 (x) 

2 

■ ln(a;) 


7r 

r(4 /i r" 

yM 

~ — 

-( n X ) V > 0 

7r \2 J 

1 functions look qualitatively like sine ( 

- 1 / 2 . 

The 

asymptotic forms for x » 

Mx) 

~ V 

[Y ( l i\ 

7TX V 2 4 J 

Y v (x) 

~ V 

1 2 ■ • / i n 

1 — sin ( x — -V7T — -7 r 
7TX y 2 4 J 


In the transition region where x ~ v, the typical amplitudes of the Bessel functions 
are on the order 


XM ~ S 

Y v {v) ~ - 


2 1 ' 3 1 0.4473 

3 2 /3r(|) §P 

2 1 / 3 1 0.7748 
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Figure 6.5.1. Bessel functions Jo{x) through Jz(x) and Yq(x) through Y 2 (x). 


which holds asymptotically for large v. Figure 6.5.1 plots the first few Bessel 
functions of each kind. 

The Bessel functions satisfy the recurrence relations 


and 


Jn+l(x) = - Jn(x) - Jn-l(x) 

X 

O77 

Y n+1 (x) = — Y n (x) - i(x) 
x 


(6.5.6) 

(6.5.7) 


As already mentioned in §5.5, only the second of these (6.5.7) is stable in the 
direction of increasing n for x < n. The reason that (6.5.6) is unstable in the 
direction of increasing n is simply that it is the same recurrence as (6.5.7): A small 
amount of “polluting” Y n introduced by roundoff error will quickly come to swamp 
the desired J n , according to equation (6.5.3). 

A practical strategy for computing the Bessel functions of integer order divides 
into two tasks: first, how to compute Jo, Ji, Y 0 , and Y 1 , and second, how to use the 
recurrence relations stably to find other J’s and T’s. We treat the first task first: 

For x between zero and some arbitrary value (we will use the value 8), 
approximate Jo (a;) and Ji(a;) by rational functions in x. Likewise approximate by 
rational functions the “regular part” of Yq(x) and Y\ (x), defined as 


Tb(x)-Jo(a;)ln(x) and Yi(:r)-|^Ji(x)ln(a;)- 

For 8 < x < oo, use the approximating forms (n = 0,1) 



„(^)si„(XJ 


(6.5.8) 



J n (x) = 


cos(X n ) - Q, 


(6.5.9) 
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Y n (x) = 



sin(X„) + Q n 



cos(X n ) 


(6.5.10) 


where 


X n = x- 


2n +1 


(6.5.11) 


and where Pq, P \, Qo, and Q i are each polynomials in their arguments, for 0 < 
8/x < 1. The P’s are even polynomials, the Q’s odd. 

Coefficients of the various rational functions and polynomials are given by 
Hart [1 ], for various levels of desired accuracy. A straightforward implementation is 


#include <math.h> 
float bessjO(float x) 

Returns the Bessel function Jo(x) for any real x. 

{ 

float ax,z; 

double xx,y,ans,ansl,ans2; Accumulate polynomials in double precision. 

if ((ax=fabs(x)) < 8.0) { Direct rational function fit. 

y=x*x; 

ansl=57568490574.0+y*(-13362590354.0+y*(651619640.7 

+y*(-11214424.18+y*(77392.33017+y*(-184.9052456))))); 
ans2=57568490411.0+y*(1029532985.0+y*(9494680.718 
+y*(59272.64853+y* (267.8532712+y*l.0)))); 
ans=ansl/ans2; 

> else { Fitting function (6.5.9). 

z=8.0/ax; 
y=z*z; 

xx=ax-0.785398164; 

ans1=1.0+y*(-0.1098628627e-2+y*(0.2734510407e-4 
+y* (-0.2073370639e-5+y*0.209388721 le-6))) ; 
ans2 = -0.1562499995e-l+y*(0.1430488765e-3 
+y*(-0.6911147651e-5+y*(0.7621095161e-6 
-y*0.934945152e-7))) ; 

ans=sqrt(0.636619772/ax)*(cos(xx)*ansl-z*sin(xx)*ans2); 

> 

return ans; 

> 


#include <math.h> 
float bessyO(float x) 

Returns the Bessel function Yq(x) for positive x. 

{ 

float bessj0(float x); 
float z; 

double xx,y,ans,ansl,ans2; Accumulate polynomials in double precision. 

if (x < 8.0) { Rational function approximation of (6.5.8). 

y=x*x; 

ansi = -2957821389.0+y*(7062834065.0+y*(-512359803.6 
+y*(10879881.29+y*(-86327.92757+y*228.4622733)))); 
ans2=40076544269.0+y*(745249964.8+y*(7189466.438 
+y*(47447.26470+y*(226.1030244+y*l.0)))); 
ans=(ansl/ans2)+0.636619772*bessj 0(x)*log(x); 

> else { Fitting function (6.5.10). 
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z=8.0/x; 
y=z*z; 

xx=x-0.785398164; 

ansl=l.0+y*(-0.1098628627e-2+y*(0.2734510407e-4 
+y*(-0.2073370639e-5+y*0.209388721le-6))); 
ans2 = -0.1562499995e-l+y*(0.1430488765e-3 
+y*(-0.6911147651e-5+y*(0.7621095161e-6 
+y*(-0.934945152e-7)))); 

ans=sqrt(0.636619772/x)*(sin(xx)*ansl+z*cos(xx)*ans2); 

> 

return ans; 


#include <math.h> 
float bessjl(float x) 

Returns the Bessel function Ji(x) for any real x. 

{ 

float ax, z; 

double xx,y,ans,ansl,ans2; Accumulate polynomials in double precision. 

if ((ax=fabs(x)) < 8.0) { Direct rational approximation. 

y=x*x; 

ansl=x*(72362614232.0+y*(-7895059235.0+y*(242396853.1 

+y*(-2972611,439+y*(15704.48260+y*(-30.16036606)))))); 
ans2=144725228442.0+y*(2300535178.0+y*(18583304.74 
+y*(99447.43394+y*(376.9991397+y*l.0)))); 
ans=ans1/ans2; 

> else { Fitting function (6.5.9). 

z=8.0/ax; 
y=z*z; 

xx=ax-2.356194491; 

ans 1=1.0+y*(0.183105e-2+y*(-0.3516396496e-4 
+y* (0.2457520174e-5+y* (-0.240337019e-6)))); 
ans2=0.04687499995+y*(-0.2002690873e-3 
+y* (0.8449199096e-5+y* (-0.88228987e-6 
+y*0.105787412e-6))) ; 

ans=sqrt(0.636619772/ax)*(cos(xx)*ansl-z*sin(xx)*ans2); 
if (x < 0.0) ans = -ans; 

> 

return ans; 

> 


#include <math.h> 
float bessyl(float x) 

Returns the Bessel function Yi(x) for positive x. 

{ 

float bessjl(float x); 

float z; 

double xx,y,ans,ansl,ans2; Accumulate polynomials in double precision. 

if (x < 8.0) { Rational function approximation of (6.5.8). 

y=x*x; 

ansl=x*(-0.4900604943e13+y*(0.1275274390el3 
+y* (-0.5153438139el 1+y* (0.7349264551e9 
+y*(-0.4237922726e7+y*0.8511937935e4))))); 
ans2=0.2499580570el4+y*(0.4244419664el2 
+y* (0.3733650367el0+y* (0.2245904002e8 
+y* (0.1020426050e6+y* (0.3549632885e3+y))))); 
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ans=(ansl/ans2)+0.636619772*(bessj1(x)*log(x)-l,0/x); 

> else { Fitting function (6.5.10). 

z=8.0/x; 

y=z*z; 

xx=x-2.356194491; 

ansl=l.0+y*(0.183105e-2+y*(-0.3516396496e-4 
+y*(0.2457520174e-5+y*(-0.240337019e-6)))); 

ans2=0.04687499995+y* (-0.2002690873e-3 
+y*(0.8449199096e-5+y*(-0.88228987e-6 
+y*0.105787412e-6))) ; 

ans=sqrt(0.636619772/x)*(sin(xx)*ansl+z*cos(xx)*ans2); 

> 

return ans; 


We now turn to the second task, namely how to use the recurrence formulas 
(6.5.6) and (6.5.7) to get the Bessel functions J n (x) and Y n (x) for n> 2. The latter 
of these is straightforward, since its upward recurrence is always stable: 

float bessy(int n, float x) 

Returns the Bessel function ln(x) for positive x and n > 2. 

{ 

float bessyOCfloat x); 
float bessyl(float x); 
void nrerror(char error_text[]); 
int j; 

float by,bym,byp,tox; 

if (n < 2) nrerrorC"Index n less than 2 in bessy"); 
tox=2.0/x; 

by=bessyl (x) ; Starting values for the recurrence. 

bym=bessy0(x); 

for (j=l;j<n;j++) { Recurrence (6.5.7). 

byp=j *tox*by-bym; 
bym=by; 
by=byp; 

> 

return by; 


The cost of this algorithm is the call to bessyl and bessyO (which generate a 
call to each of bessj 1 and bessj 0), plus O(n) operations in the recurrence. 

As for J n (x ), things are a bit more complicated. We can start the recurrence 
upward on n from J 0 and J\, but it will remain stable only while n does not exceed 
x. This is, however, just fine for calls with large x and small n, a case which 
occurs frequently in practice. 

The harder case to provide for is that with x < n. The best thing to do here 
is to use Miller’s algorithm (see discussion preceding equation 5.5.16), applying 
the recurrence downward from some arbitrary starting value and making use of the 
upward-unstable nature of the recurrence to put us onto the correct solution. When 
we finally arrive at Jo or J\ we are able to normalize the solution with the sum 
(5.5.16) accumulated along the way. 

The only subtlety is in deciding at how large an n we need start the downward 
recurrence so as to obtain a desired accuracy by the time we reach the n that we 
really want. If you play with the asymptotic forms (6.5.3) and (6.5.5), you should 
be able to convince yourself that the answer is to start larger than the desired n by 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



6.5 Bessel Functions of Integer Order 


235 


an additive amount of order [constant x n] */ 2 , where the square root of the constant 
is, very roughly, the number of significant figures of accuracy. 

The above considerations lead to the following function. 


#include <math.h> 

#define ACC 40.0 IV 

#define BIGNO l.OelO 
#define BIGNI 1.0e-10 

float bessj(int n, float x) 

Returns the Bessel function Jn(x) for any real 

{ 

float bessjOCfloat x); 
float bessj1(float x); 
void nrerror(char error_text[]); 
int j,j sum,m; 

float ax,bj,bjm,bjp,sum,tox,ans; 

if (n < 2) nrerror("Index n less 
ax=fabs(x); 
if (ax == 0.0) 
return 0.0; 

else if (ax > (float) n) { 
tox=2.0/ax; 
bjm=bessj0(ax); 
bj=bessj1(ax); 
for (j=l;j<n;j++) { 
bj P = j *tox*b j-bj m; 
bjm=bj; 
b J= b jp; 

> 

ans=bj; 

> else { 

tox=2.0/ax; 

m=2*((n+(int) sqrt(ACC*n))/2) 
j sum=0; 

bj p=ans=sum=0.0; 
b j=l.0; 

for (j=m;j>0;j—) { 
b j m=j *tox*bj -b j p; 
b jP =b j; 
bj=bjm; 

if (fabs(bj) > BIGNO) { 
bj *= BIGNI; 
bjp *= BIGNI; 
ans *= BIGNI; 
sum *= BIGNI; 

> 

if (jsum) sum += bj; 
j sum=!j sum; 
if (j == n) ans=bjp; 

> 

sum=2.0*sum-bj; 
ans /= sum; 

> 

return x < 0.0 kk (n & 1) ? -ans 


larger to increase accuracy. 


and n > 2. 


than 2 in bessj"); 


Upwards recurrence from Jo and J\. 


Downwards recurrence from an even m here com¬ 
puted. 

jsum will alternate between 0 and 1; when it is 
1, we accumulate in sum the even terms in 
(5.5.16). 

The downward recurrence. 


Renormalize to prevent overflows. 


Accumulate the sum. 

Change 0 to 1 or vice versa. 

Save the unnormalized answer. 

Compute (5.5.16) 

and use it to normalize the answer. 

: ans; 



> 
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CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), Chapter 9. 

Hart, J.F., et al. 1968, Computer Approximations (New York: Wiley), §6.8, p. 141. [1] 


6.6 Modified Bessel Functions of Integer Order 

The modified Bessel functions I n (x) and K n (x) are equivalent to the usual 
Bessel functions J n and Y n evaluated for purely imaginary arguments. In detail, 
the relationship is 


I n (x) = (—i) n J n (ix) 

7T ^ (6.6.1) 

K n (x) = —i n+1 [J n (ix) + iY n (ix )] 

The particular choice of prefactor and of the linear combination of J n and Y n to form 
K n are simply choices that make the functions real-valued for real arguments x. 

For small arguments x -C n, both I n (x) and K n (x) become, asymptotically, 
simple powers of their argument 


In{x) « 

*a(! )" 


K 0 (x) a 

^ — ln(a;) 

(6.6.2) 

K n (x) a 

s <^ ( ir ">o 



These expressions are virtually identical to those for J n (x ) and Y n (x) in this region, 
except for the factor of —2/tt difference between Y n (x) and K n (x). In the region 
x » n, however, the modified functions have quite different behavior than the 
Bessel functions, 


I n (x) as — 1 == exp(a;) 

™ (6.6.3) 

K„(x) ss —f== exp(—a;) 

\/2ttx 

The modified functions evidently have exponential rather than sinusoidal be¬ 
havior for large arguments (see Figure 6.6.1). The smoothness of the modified 
Bessel functions, once the exponential factor is removed, makes a simple polynomial 
approximation of a few terms quite suitable for the functions Io, h, Ko, and K i. 
The following routines, based on polynomial coefficients given by Abramowitz and 
Stegun [1 ], evaluate these four functions, and will provide the basis for upward 
recursion for n > 1 when x > n. 
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x 


Figure 6.6.1. Modified Bessel functions Iq(x ) through 73 (re), Ko(x) through K 2 (x). 


#include <math.h> 
float bessiO(float x) 

Returns the modified Bessel function Io(x.) for any real x. 

{ 

float ax,ans; 

double y; Accumulate polynomials in double precision. 

if ((ax=fabs(x)) < 3.75) { Polynomial fit. 

y=x/3.75; 

y*=y; 

ans=l.0+y*(3.5156229+y*(3.0899424+y*(1.2067492 

+y*(0.2659732+y*(0.360768e-l+y*0.45813e-2))))); 

> else { 

y=3.75/ax; 

ans=(exp(ax)/sqrt(ax))*(0.39894228+y*(0.1328592e-l 
+y*(0.225319e-2+y*(-0.157565e-2+y*(0.916281e-2 
+y*(-0.2057706e-l+y*(0.2635537e-l+y*(-0.1647633e-l 
+y*0.392377e-2)))))))); 

> 

return ans; 


#include <math.h> 
float besskO(float x) 

Returns the modified Bessel function Ko(x) for positive real x. 

{ 



float bessi0(float x); 
double y,ans; 


Accumulate polynomials in double precision. 
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if (x <= 2.0) { Polynomial fit. 

y=x*x/4.0; 

ans=(-log(x/2.0)*bessiO(x))+(-0.57721566+y*(0.42278420 
+y*(0.23069756+y* (0.3488590e- 1+y * (0.262698e-2 
+y*(0.10750e-3+y*0.74e-5))))) ) ; 

> else { 

y=2.0/x; 

ans=(exp(-x)/sqrt(x))*(1.25331414+y*(-0.7832358e-l 
+y*(0.2189568e-l+y*(-0.1062446e-l+y*(0.587872e-2 
+y*(-0.251540e-2+y*0.53208e-3)))))); 

> 

return ans; 


#include <math.h> 
float bessil(float x) 

Returns the modified Bessel function 7i(x) for any real x. 

< 

float ax,ans; 

double y; Accumulate polynomials in double precision. 

if ((ax=fabs(x)) < 3.75) { Polynomial fit. 

y=x/3.75; 
y*=y; 

ans=ax*(0.5+y*(0.87890594+y*(0.51498869+y*(0.15084934 
+y*(0.2658733e-l+y*(0.301532e-2+y*0.32411e-3)))))); 

> else { 

y=3.75/ax; 

ans=0.2282967e-l+y*(-0.2895312e-l+y*(0.1787654e-l 
-y*0.420059e-2)); 

ans=0.39894228+y*(-0.3988024e-l+y*(-0.362018e-2 
+y*(0.163801e-2+y*(-0.1031555e-l+y*ans)))); 
ans *= (exp(ax)/sqrt(ax)); 

> 

return x < 0.0 ? -ans : ans; 


#include <math.h> 
float besskl(float x) 

Returns the modified Bessel function K i(x) for positive real x. 

{ 

float bessil(float x); 

double y,ans; Accumulate polynomials in double precision. 

if (x <= 2.0) { Polynomial fit. 

y=x*x/4.0; 

ans=(log(x/2.0)*bessil(x))+(l.0/x)*(l.0+y*(0.15443144 
+y*(-0.67278579+y*(-0.18156897+y*(-0.1919402e-l 
+y*(-0.110404e-2+y*(-0.4686e-4))))))); 

> else { 

y=2.0/x; 

ans=(exp(-x)/sqrt(x))*(1.25331414+y*(0.23498619 

+y* (-0.3655620e-l+y* (0.1504268e- 1+y* (-0.780353e-2 
+y*(0.325614e-2+y*(-0.68245e-3))))))); 

> 

return ans; 



> 
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The recurrence relation for I n {x) and K n (x) is the same as that for J n (x) 
and Y n (x) provided that ix is substituted for x. This has the effect of changing 
a sign in the relation, 


I n+1 {x) = -(-\l n {x) + I n . 1 {x) 

/2 \ (6 - 6 - 4) 
K n+1 (x) = + (-£) K n (x) + K„ { (x) 

These relations are always unstable for upward recurrence. For K n , itself growing, 
this presents no problem. For /„, however, the strategy of downward recursion is 
therefore required once again, and the starting point for the recursion may be chosen 
in the same manner as for the routine bessj. The only fundamental difference is 
that the normalization formula for I n (x) has an alternating minus sign in successive 
terms, which again arises from the substitution of ix for x in the formula used 
previously for .J n 


1 = I 0 (x) - 21 2 (x) + 2 / 4 ( 2 ;) - 2I e (x) + ■■■ (6.6.5) 

In fact, we prefer simply to normalize with a call to bessiO. 

With this simple modification, the recursion routines bessj and bessy become 
the new routines bessi and bessk: 

float bessk(int n, float x) 

Returns the modified Bessel function Kn(x) for positive x and n > 2. 

{ 

float besskO(float x); 
float besskl(float x); 
void nrerror(char error_text[]); 
int j; 

float bk.bkm.bkpjtox; 

if (n < 2) nrerror("Index n less than 2 in bessk"); 
tox=2.0/x; 

bkm=besskO(x); Upward recurrence for all x... 

bk=besskl(x); 

for (j=l;j<n;j++) { ...and here it is. 

bkp=bkm+j *tox*bk; 
bkm=bk; 
bk=bkp; 

> 

return bk; 


#include <math.h> 

#define ACC 40.0 Make larger to increase accuracy. 

#define BIGNO l.OelO 
#define BIGNI 1.0e-10 

float bessi(int n, float x) 

Returns the modified Bessel function /n(x) for any real x and n > 2. 

{ 

float bessiO(float x); 

void nrerror(char error_text[]); 
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int j; 

float bi,bim,bip,tox,ans; 


if (n < 2) nrerror("Index n less than 2 in bessi"); 
if (x == 0.0) 
return 0.0; 
else { 

tox=2.0/fabs(x); 
bip=ans=0.0; 
bi=l.0; 

for (j=2*(n+(int) sqrt(ACC*n)) ; j>0; j—) { Downward recurrence from even 
bim=bip+j*tox*bi; m. 

bip=bi; 


if (fabs(bi) > BIGN0) { Renormalize to prevent overflows, 

ans *= BIGNI; 
bi *= BIGNI; 
bip *= BIGNI; 

> 

if (j == n) ans=bip; 

> 

ans *= bessi0(x)/bi; Normalize with bessiO. 

return x < 0.0 fc& (n&l) ? -ans : ans; 

> 

> 


CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), §9.8. [1] 

Carrier, G.F., Krook, M. and Pearson, C.E. 1966, Functions of a Complex Variable (New York: 
McGraw-Hill), pp. 220ff. 


6.7 Bessel Functions of Fractional Order, Airy 
Functions, Spherical Bessel Functions 

Many algorithms have been proposed for computing Bessel functions of fractional order 
numerically. Most of them are, in fact, not very good in practice. The routines given here are 
rather complicated, but they can be recommended wholeheartedly. 

Ordinary Bessel Functions 

The basic idea is Steed’s method, which was originally developed [1 ] for Coulomb wave 
functions. The method calculates J„, J', Y v , and Y„' simultaneously, and so involves four 
relations among these functions. Three of the relations come from two continued fractions, 
one of which is complex. The fourth is provided by the Wronskian relation 

W = J V Y^ - Yv£ = — (6.7.1) 

TTX 

The first continued fraction, CF1, is defined by 

„ _ J'„ _ V Ju+l 
Jv = X~x~ ~17 

_v _1_ 1 

x 2(u+l)/x — 2(y + 2)/x — 



(6.7.2) 
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You can easily derive it from the three-term recurrence relation for Bessel functions: Start with 
equation (6.5.6) and use equation (5.5.18). Forward evaluation of the continued fraction by 
one of the methods of §5.2 is essentially equivalent to backward recurrence of the recurrence 
relation. The rate of convergence of CF1 is determined by the position of the turning point 
x t P + 1) fB v, beyond which the Bessel functions become oscillatory. If x ^ aitp, 

convergence is very rapid. If x ^ a: t p, then each iteration of the continued fraction effectively 
increases v by one until x ^ xt p ; thereafter rapid convergence sets in. Thus the number 
of iterations of CF1 is of order x for large x. In the routine bessjy we set the maximum 
allowed number of iterations to 10,000. For larger x, you can use the usual asymptotic 
expressions for Bessel functions. 

One can show that the sign of J v is the same as the sign of the denominator of CF1 
once it has converged. 

The complex continued fraction CF2 is defined by 


, t ' 'V 1 |,| « (V 2 ) 2 -^ 2 (3/2) 2 — v 2 

J v + iYv 2x x 2(x + i) + 2(x + 2i) + 


(6.7.3) 


(We sketch the derivation of CF2 in the analogous case of modified Bessel functions in the 
next subsection.) This continued fraction converges rapidly for x ^ xt P , while convergence 
fails as x —> 0. We have to adopt a special method for small x, which we describe below. For 
x not too small, we can ensure that x xt p by a stable recurrence of Jv and J' downwards 
to a value v s= p, £ x, thus yielding the ratio at this lower value of v. This is the stable 
direction for the recurrence relation. The initial values for the recurrence are 


Jv — arbitrary. ,/' — 


(6.7.4) 


with the sign of the arbitrary initial value of Jv chosen to be the sign of the denominator of 
CF1. Choosing the initial value of./,, very small minimizes the possibility of overflow during 
the recurrence. The recurrence relations are 


Jv- §§= -Jv+J'v 

x 

J'_ l = —- Jv-l - Jv 


(6.7.5) 


Once CF2 has been evaluated at v = p, then with the Wronskian (6.7.1) we have enough 
relations to solve for all four quantities. The formulas are simplified by introducing the quantity 


Then 



(6.7.6) 


J u = ± 


W 


q + jip- fu) 


j IX — fn-Jp 
Yu = 7 Ju 

K = Y u{p+^) 


(6.7.7) 

(6.7.8) 

(6.7.9) 

(6.7.10) 


The sign of J,,, in (6.7.7) is chosen to be the same as the sign of the initial J v in (6.7.4). 

Once all four functions have been determined at the value v = p, we can find them at the 
original value of v. For J v and J', simply scale the values in (6.7.4) by the ratio of (6.7.7) to 
the value found after applying the recurrence (6.7.5). The quantities Y u and Yj can be found 
by starting with the values in (6.7.9) and (6.7.10) and using the stable upwards recurrence 


Yv+i 


2 l/ Yv-Yv..i 


(6.7.11) 



together with the relation 


IS =-Yv- Vv+1 


(6.7.12) 
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Now turn to the case of small x, when CF2 is not suitable. Temme [2] has given a 
good method of evaluating Y u and Yv+ i, and hence Y/ from (6.7.12), by series expansions 
that accurately handle the singularity as x —* 0. The expansions work only for \u\ < 1/2, 
and so now the recurrence (6.7.5) is used to evaluate f v at a value u = // in this interval. 
Then one calculates from 

4 = Y , W Y f ( 6 - 7 ' 13 ) 

and .7/ from (6.7.8). The values at the original value of v are determined by scaling as before, 
and the Y’s are recurred up as before. 

Temme’s series are 


Yu ^2 c k gk Yu+ 1 = — ^ c k h k 

k =0 X k =0 

* (—x 2 /4) k 


5.7.14) 


5.7.15) 


while the coefficients gk and h k are defined in terms of quantities pk, q k , and f k that can 
be found by recursion: 


2.2 (vk\ 

g k = fk+ - sin )q k 
hk = ~kg k +Pk 


Pk = 

q k = 


Pk -1 
k — v 
q k -i 

k + v 


fk = 


kfk -1 + Pk -1 + qk -1 
fc 2 - 1/2 


(6.7.16) 


The initial values for the recurrences are 




fo = — I cosh crTi (i/) - 

7T Sin I/7T 


a = v In ( — ) 

\xj 

i r i 


Ti« = 


ln - r 2 (v) 


i 


2 ^ L r (i — ^) 


r 2 (Y) = - 


i 


i 


2 Lr(i-j/) r(i + v) 


5.7.17) 


5.7.18) 


The whole point of writing the formulas in this way is that the potential problems as v —> 0 
can be controlled by evaluating vt\ / sin vk, sinh a/a, and I\ carefully. In particular, Temme 
gives Chebyshev expansions for Ti (v) and r 2 (i/). We have rearranged his expansion for Ti 
to be explicitly an even series in v so that we can use our routine chebev as explained in §5.8. 

The routine assumes v > 0. For negative v you can use the reflection formulas 


J-u = cos wk J v — sin wk Y„ 
Y-u = sin wk J„ + cos uk Y v 


(6.7.19) 


The routine also assumes x > 0. For x < 0 the functions are in general complex, but 
expressible in terms of functions with x > 0. For x = 0, Y„ is singular. 
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Internal arithmetic in the routine is carried out in double precision. The complex 
arithmetic is carried out explicitly with real variables. 

#include <math.h> 

#include "nrutil.h" 

#define EPS 1.0e-10 

#define FPMIN 1.0e-30 

#define MAXIT 10000 

#define XMIN 2.0 

#define PI 3.141592653589793 


void bessjy(float x, float xnu, float *rj, float *ry, float *rjp, float *ryp) 
Returns the Bessel functions rj ssjjjEt ry = Y v and their derivatives rjp = J' v , ryp = Y/, for 
positive x and for xnu = v > 0. The relative accuracy is within one or two significant digits 
of EPS, except near a zero of one of the functions, where EPS controls its absolute accuracy. 
FPMIN is a number close to the machine's smallest floating-point number. All internal arithmetic 
is in double precision. To convert the entire routine to double precision, change the float 
declarations above to double and decrease EPS to 10 -16 . Also convert the function beschb. 
{ 

void beschb(double x, double *gaml, double *gam2, double *gampl, 
double *gammi); 

int i,isign,1,nl; 

double a,b,br,bi,c,cr,ci,d,del,dell,den,di,dir,dli,dr,e,f.fact,fact2, 
f act3,f f,gam,garni,gam2,gammi,gampl,h,p,pimu,pimu2,q,r,rj1, 
rjll.rjmu.rjpl,rjpl.rjtemp,ry1,rymu,rymup,rytemp,sum,suml, 
temp,w,x2,xi,xi2,xmu,xmu2; 


if (x <= 0.0 II xnu < 0.0) nrerror("bad arguments in bessjy"); 

nl=(x < XMIN ? (int)(xnu+0.5) : IMAX(0,(int)(xnu-x+1.5))); 

nl is the number of downward recurrences of the J's and upward recurrences of Y's. xmu 

lies between —1/2 and 1/2 for x < XMIN, while it is chosen so that x is greater than the 

turning point for x > XMIN. 

xmu=xnu-nl; 

xmu2=xmu*xmu; 

xi=l,0/x; 


xi2=2.0*xi; 
w=xi2/PI; 
isign=l; 
h=xnu*xi; 

if (h < FPMIN) h=FPMIN; 


The Wronskian. 

Evaluate CF1 by modified Lentz's method (§5.2). 
isign keeps track of sign changes in the de¬ 
nominator. 


b=xi2*xnu; 


d=0.0; 


for (i=l; i<=MAXIT; i++) { 
b += xi2; 
d=b-d; 

if (fabs(d) < FPMIN) d=FPMIN; 
c=b-1.0/c; 

if (fabs(c) < FPMIN) c=FPMIN; 

d=l.0/d; 

del=c*d; 

h=del*h; 

if (d < 0.0) isign = -isign; 
if (fabs(del-1.0) < EPS) break; 

> 

if (i > MAXIT) nrerrorC'x too large in bessjy; try asymptotic expansion"); 
rjl=isign*FPMIN; Initialize J v and J' v for downward recurrence. 

rjpl=h*rjl; 
rjll=rjl; 
rjpl=rjpl; 
fact=xnu*xi; 
for (l=nl; 1>=1; 1—) { 

rjtemp=fact*rjl+rjpl; 
fact -= xi; 



Store values for later rescaling. 
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rjpl=fact*rjtemp-rjl; 
rjl=rjtemp; 

> 

if (rjl == 0.0) rjl=EPS; 

f=rjpl/rjl; Now have unnormalized and J 

if (x < XMIN) { Use series. 

x2=0.5*x; 
pimu=PI*xmu; 

fact = (fabs(pimu) < EPS ? 1.0 : pimu/sin(pimu)); 
d = -log(x2); 
e=xmu*d; 

fact2 = (fabs(e) < EPS ? 1.0 : sinb(e)/e); 

beschb(xmu,&gaml,&gam2,&gampl,&gammi); Chebyshev evaluation of Ti and T 2 - 
ff=2.0/PI*fact*(gaml*cosh(e)+gam2*fact2*d); /o- 

e=exp(e); 

p=e/(gampl*PI); po- 

q=l.0/(e*PI*gammi); qg. 

pimu2=0.5*pimu; 

fact3 = (fabs(pimu2) < EPS ? 1.0 : sin(pimu2)/pimu2); 

r=PI*pimu2*f act3*f act3; 

c=l.0; 

d = -x2*x2; 

sum=ff+r*q; 

suml=p; 

for (i=l;i<=MAXIT;i++) { 

ff=(i*ff+p+q)/(i*i-xmu2); 
c *= (d/i); 
p /= (i-xmu); 
q /= (i+xmu); 
del=c*(ff+r*q) ; 
sum += del; 
dell=c*p-i*del; 
suml += dell; 

if (fabs(del) < (1.0+fabs(sum))*EPS) break; 

> 

if (i > MAXIT) nrerror("bessy series failed to converge"); 
rymu = -sum; 
ryl = -suml*xi2; 
rymup=xmu*xi*rymu-ryl; 

rjmu=w/(rymup-f*rymu); Equation (6.7.13). 

> else { Evaluate CF2 by modified Lentz's method (§5.2). 

a=0.25-xmu2; 
p = -0.5*xi; 
q=1.0; 
br=2.0*x; 
bi=2.0; 

fact=a*xi/(p*p+q*q); 
cr=br+q*fact; 
ci=bi+p*fact; 
den=br*br+bi*bi; 
dr=br/den; 
di = -bi/den; 
dlr=cr*dr-ci*di; 
dli=cr*di+ci*dr; 
temp=p*dlr-q*dli; 
q=p*dli+q*dlr; 
p=temp; 

for (i=2;i<=MAXIT;i++) { 
a += 2* (i—1) ; 
bi += 2.0; 
dr=a*dr+br; 
di=a*di+bi; 

if (fabs(dr)tfabs(di) < FPMIN) dr=FPMIN; 
fact=a/(cr*cr+ci*ci); 
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cr=br+cr*fact; 
ci=bi-ci*fact; 

if (fabs(cr)+fabs(ci) < FPMIN) cr=FPMIN; 

den=dr*dr+di*di; 

dr /= den; 

di /= -den; 

dlr=cr*dr-ci*di; 

dli=cr*di+ci*dr; 

temp=p*dlr-q*dli; 

q=p*dli+q*dlr; 

p=temp; 

if (fabs(dlr-1.0)+fabs(dli) < EPS) break; 

> 

if (i > MAXIT) nrerror("cf2 failed in bessjy"); 

gam=(p-f)/q; Equations (6.7.6) - (6.7.10). 

rjmu=sqrt (w/ ( (p-f ) *gam+q) ) ; 

rjmu=SIGN(rjmu,rjl); 

rymu=rj mu*gam; 

rymup=rymu*(p+q/gam); 

ryl=xmu*xi*rymu-rymup; 

> 

fact=rjmu/rjl; 

*rj=rjll*fact ; Scale original and J' u . 

*rjp=rjpl*fact; 

for (i=l;i<=nl;i++) { Upward recurrence of Y v . 

rytemp=(xmu+i)*xi2*ryl-rymu; 
rymu=ryl; 
ryl=rytemp; 

> 

*ry=rymu; 

*ryp=xnu*xi*rymu-ryl; 


#define NUSE1 5 
#define NUSE2 5 

void beschb(double x, double *gaml, double *gam2, double *gampl, double *gammi) 
Evaluates Ti and T 2 by Chebyshev expansion for |x| < 1/2. Also returns l/r(l + x) and 
l/r(l — x). If converting to double precision, set NUSE1 = 7, NUSE2 = 8. 

{ 

float chebev(float a, float b, float c[], int m, float x); 
float xx; 

static float cl [] = { 

-1.142022680371168e0,6.5165112670737e-3, 

3.087090173086e-4,-3.4706269649e-6,6.9437664e-9, 

3.67795e-l 1, -1.356e-13>; 
static float c2[] = { 

1.843740587300905e0, -7.68528408447867e-2, 

1.2719271366546e-3, -4.9717367042e-6, -3.31261198e-8, 

2.423096e-10,-1.702e-13,-1.49e-15>; 

xx=8.0*x*x-1.0; Multiply x by 2 to make range be —1 to 1, 

*gaml=chebev(-l.0,1.0,cl,NUSE1,xx); and then apply transformation for eval- 

*gam2=chebev(-1.0,1.0,c2,NUSE2,xx); uating even Chebyshev series. 

*gampl= *gam2-x*(*gaml); 

*gammi= *gam2+x*(*gaml); 



> 
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Modified Bessel Functions 


Steed’s method does not work for modified Bessel functions because in this case CF2 is 
purely imaginary and we have only three relations among the four functions. Temme[3] has 
given a normalization condition that provides the fourth relation. 

The Wronskian relation is 

W = I V K' V - K,I' V = 

X 

(6.7.20) 

The continued fraction CF1 becomes 


f = Iv — v -|- 1 1 

I v x 2(u + l)/x + 2(u + 2)/x + 

(6.7.21) 

To get CF2 and the normalization condition in a convenient form, consider the sequence 
of confluent hypergeometric functions 

Zn (x) = U(y + 1/2 + n,2u+ 1, 2a;) 

(6.7.22) 

for fixed v. Then 


K v { x) = % 1 ^ 2 {2x)''e~ x zo{x) 

(6.7.23) 

K v (x) x [ 2 \ 4/ zqJ 

(6.7.24) 


Equation (6.7.23) is the standard expression for K v in terms of a confluent hypergeometric 
function, while equation (6.7.24) follows from relations between contiguous confluent hy¬ 
pergeometric functions (equations 13.4.16 and 13.4.18 in Abramowitz and Stegun). Now 
the functions z n satisfy the three-term recurrence relation (equation 13.4.15 in Abramowitz 
and Stegun) 


with 


Z n -l(x) = b„Z„(x) + dn+lZn+l 


(6.7.25) 


b n = 2 (n + a;) 

On+i = — [(n + 1/2) 2 — z/ 2 ] 


(6.7.26) 


Following the steps leading to equation (5.5.18), we get the continued fraction CF2 

2l _ 1 «2 . 

Zo b\ + &2 + 

from which (6.7.24) gives K v+ \/K v and thus K' u / K v . 

Temme’s normalization condition is that 




where 


C„ = 


, ^+1/2 
2a ;) 

(—l) n r(;z + 1/2 + n) 


n\ Y(v + 1/2 — n) 
Note that the C„’s can be determined by recursion; 

ffln+l 


Cn-Sm — ^rCn 
n + 1 


O, - 1, 

We use the condition (6.7.28) by finding 


(6.7.27) 

(6.7.28) 

(6.7.29) 

(6.7.30) 

(6.7.31) 



Then 


z 0 = 


-+1/2 


1 

1 + S 


(6.7.32) 
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and (6.7.23) gives K v . 

Thompson and Barnett [4] have given a clever method of doing the sum (6.7.31) 
simultaneously with the forward evaluation of the continued fraction CF2. Suppose the 
continued fraction is being evaluated as 

— = jrAh n (6.7.33) 

Z ° n =0 

where the increments A h n are being found by, e.g., Steed’s algorithm or the modified Lentz’s 
algorithm of §5.2. Then the approximation to S keeping the first N terms can be found as 

N 

S N = QnAhn (6.7.34) 

77 . = 1 

Here 


Q n = J2 C k q k (6.7.35) 

k =1 

and q k is found by recursion from 

q k + i = (q k -i — b k q k )/a k+ 1 (6.7.36) 

starting with qo = 0, qi = 1. For the case at hand, approximately three times as many terms 
are needed to get S to converge as are needed simply for CF2 to converge. 

To find K v and K„+\ for small x we use series analogous to (6.7.14): 


Flere 


K v = $>/* 

k=0 


K„+ 


2 

X 


y ckh k 

k=0 


(6.7.37) 


Cfc 


OW 

k\ 


hk = -kfk +Pk 


Pk = 

q k = 


k —i0 
q k -i 
k + v 


fk = 


kfk -i + Pk -i + q k -i 

fc 2 - iz 2 


(6.7.38) 


The initial values for the recurrences are 

Po= \{ I) + 

9° = ^(|)" r ( 1 -^) (6.7.39) 

,/'o — U ~ [coshcrTi(i^) + Sinha In r 2 (z/)l 

sm Z/7T [ cr \ x J \ 

Both the series for small x, and CF2 and the normalization relation (6.7.28) require 
\v\ < 1/2. In both cases, therefore, we recurse I v down to a value u = p, in this interval, find 
there, and recurse K u back up to the original value of v. 

The routine assumes v > 0. For negative v use the reflection formulas 
2 

I- v —I v -\ -sin(z/7r) K v 

k (6.7.40) 

K-„ = K v 


Note that for large x, /„ ~ e x , K u ~ e~ x , and so these functions will overflow or 
underflow. It is often desirable to be able to compute the scaled quantities e~ x I v and e x K v . 
Simply omitting the factor e~ x in equation (6.7.23) will ensure that all four quantities will 
have the appropriate scaling. If you also want to scale the four quantities for small x when 
the series in equation (6.7.37) are used, you must multiply each series by e x . 
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#include <math.h> 

#define EPS 1.0e-10 

#define FPMIN 1.0e-30 

#define MAXIT 10000 

#define XMIN 2.0 

#define PI 3.141592653589793 


void bessik(float x, float xmi, float *ri, float *rk, float *rip, float *rkp) 
Returns the modified Bessel functions ri — rk = K v and their derivatives rip = 
rkp = Kl, for positive x and for xnu = v > 0. The relative accuracy is within one or two 
significant digits of EPS. FPMIN is a number close to the machine's smallest floating-point 
number. All internal arithmetic is in double precision. To convert the entire routine to double 
precision, change the float declarations above to double and decrease EPS to 10 -16 . Also 
convert the function beschb. 
i 

void beschb(double x, double *gaml, double *gam2, double *gampl, 
double *gammi); 

void nrerror(char error_text[]); 

int 1,1,nl; 

double a,al,b,c,d,del,dell,delh,dels,e,f,fact,fact2,ff,gaml,gam2, 
gammi,gampl,h,p,pimu,q,ql,q2,qnew,ril,rill,rimu,ripl,ripl, 
ritemp,rkl,rkmu,rkmup,rktemp,s,sum,suml,x2,xi,xi2,xmu,xmu2; 


if (x <= 0.0 II xnu < 0.0) nrerrorC'l 

nl=(int)(xnu+0.5); 

xmu=xnu-nl; 

xmu2=xmu*xmu; 

xi=l,0/x; 

xi2=2.0*xi; 

h=xnu*xi; 

if (h < FPMIN) h=FPMIN; 

b=xi2*xnu; 

d=0.0; 

c=h; 

for (i=l;i<=MAXIT;i++) { 
b += xi2; 
d=1.0/(b+d); 
c=b+1.0/c; 
del=c*d; 
h=del*h; 

if (fabs(del-1.0) < EPS) break; 

> 

if (i > MAXIT) nrerrorC'x too large in 

ril=FPMIN; 

ripl=h*ril; 

rill=ril; 

ripl=ripl; 

fact=xnu*xi; 

for (l=nl;l>=l;1—) { 

ritemp=fact*ril+ripl; 
fact -= xi; 


arguments in bessik"); 

nl is the number of downward re¬ 
currences of the I’s and upward 
recurrences of K's. xmu lies be¬ 
tween — 1/2 and 1/2. 

Evaluate CF1 by modified Lentz's 
method (§5.2). 


Denominators cannot be zero here, 
so no need for special precau¬ 
tions. 


:; try asymptotic expansion"); 
Initialize I v and I' v for downward re¬ 
currence. 

Store values for later rescaling. 


ripl=fact*ritemp+ril; 
ril=ritemp; 

> 

f=ripl/ril; Now have unnormalized and 1'^. 

if (x < XMIN) { Use series. 

x2=0.5*x; 
pimu=PI*xmu; 

fact = (fabs(pimu) < EPS ? 1.0 : pimu/sin(pimu)); 
d = -log(x2); 
e=xmu*d; 

fact2 = (fabs(e) < EPS ? 1.0 : sinh(e)/e); 

beschb (xmu, &gaml,&gam2,&gampl,&gammi); Chebyshev evaluation of Ti and TV 

ff=fact*(gaml*cosh(e)+gam2*fact2*d); /q. 
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sum=ff; 
e=exp(e); 

p=0.5*e/gampl; po- 

q=0.5/(e*gammi); q o. 

c=l.0; 
d=x2*x2; 
suml=p; 

for (i=l;i<=MAXIT;i++) { 

ff=(i*ff+p+q)/(i*i-xmu2); 
c *= (d/i); 
p /= (i-xrni); 
q /= (i+xmu) ; 
del=c*ff; 
sum += del; 
dell=c*(p-i*ff); 
suml += dell; 

if (fabs(del) < fabs(sum)*EPS) break; 

> 

if (i > MAXIT) nrerrorC'bessk series failed to converge"); 
rkmu=sum; 
rkl=suml*xi2; 

slse { Evaluate CF2 by Steed's algorithm 

b=2.0*(l .0+x) ; (§5.2), which is OK because there 

d=1.0/b; can be no zero denominators. 

h=delh=d; 

ql=0.0; Initializations for recurrence (6.7.35). 

q2=l.0; 
al=0.25-xmu2; 

q=c=al; First term in equatioi 

a = -al; 
s=1.0+q*delh; 
for (i=2;i<=MAXIT;i++) { 
a -= 2*(i—1); 
c = -a*c/i; 
qnew=(ql-b*q2)/a; 
ql=q2; 
q2=qnew; 


rst term in equation (6.7.34). 


b += 2.0; 

d=l.0/(b+a*d); 

delh=(b*d-l.0)*delh; 

h += delh; 

dels=q*delh; 

s += dels; 

if (fabs(dels/s) < EPS) break; 

Need only test convergence of sum since CF2 itself converges more quickly. 


if (i > MAXIT) nrerror("bessik: failure to converge in cf2"); 
h=al*h; 

rkmu=sqrt(PI/(2.0*x))*exp(-x)/s; Omit the factor exp(- 

rkl=rkmu*(xmu+x+0.5-h)*xi; all the returned func 


rkmup=xmu*xi*rkmu-rkl; 
rimu=xi/(f*rkmu-rkmup); 

*ri=(rimu*rill)/ril; 

*rip=(rimu*ripl)/ril; 
for (i=l;i<=nl;i++) { 

rktemp=(xmu+i)*xi2*rkl+rkmu; 

rkmu=rkl; 

rkl=rktemp; 


Omit the factor exp(— x) to scale 
all the returned functions by exp(a;) 
for x > XMIN. 


Get from Wronskian. 
Scale original I„ and I' v . 

Upward recurrence of K v 


*rkp=xnu*xi*rkmu-rk 
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Airy Functions 


For positive x, the Airy functions are defined by 


Ai(*) = l\fl Kl/3{z) (6-7.41) 

Bi(*) = lh/ 3 (z) + I-i/ 3 (z)} (6.7.42) 

where 

z~'?x 3/ * (6.7.43) 

By using the reflection formula (6.7.40), we can convert (6.7.42) into the computationally 
more useful form 

Bi(z) = yfc + \k x /? 4)] (6.7.44) 

so that Ai and Bi can be evaluated with a single call to bessik. 

The derivatives should not be evaluated by simply differentiating the above expressions 
because of possible subtraction errors near x = 0. Instead, use the equivalent expressions 


Ai '(x) = --^K 2/3 (z) 

Bi'(*) = x [^g/ a/8 (*) + ^2/3(2)] 


(6.7.45) 


The corresponding formulas for negative arguments are 

Ai(—*> = ^ [ji/ 3 («) - 4 K/s(^)] 

Bi(-x) = [4=4/3(*) +Y 1/3 {z)] 

r' , -| (6.7.46) 

Ai'(-x) = | \j 2/3 (z) + ±Y va ( g ) j 

Bi'(- ;E )= | 


#include <math.h> 

#define PI 3.1415927 
#define THIRD (1.0/3.0) 
#define TW0THR (2.0*THIRD) 
#define 0N0VRT 0.57735027 


void airy(float x, float *ai, float *bi, float *aip, float *bip) 
Returns Airy functions Ai(oi), Bi(ar), and their derivatives Ai'(a;), Bi'(o:). 

{ 

void bessik(float x, float xnu, float *ri, float *rk, float *rip, 
float *rkp); 

void bessjy(float x, float xnu, float *r j , float *ry, float *rjp, 
float *ryp); 

float absx,ri,rip,rj,rjp,rk,rkp,rootx,ry,ryp,z; 


absx=fabs(x); 
rootx=sqrt(absx); 
z=TW0THR*absx*rootx; 
if (x > 0.0) { 

bessik(z,THIRD,&ri,&rk,&rip,&rkp); 
*ai=rootx*0N0VRT*rk/PI; 
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*bi=rootx*(rk/PI+2.0*0N0VRT*ri); 
bessik(z,TWOTHR,&ri,&rk,&rip,fcrkp); 
*aip = -x*ONOVRT*rk/PI; 

*bip=x*(rk/PI+2.0*0N0VRT*ri); 

} else if (x < O.O) { 

bessjyCz,THIRD,&rj,fcry,&rjp,fcryp); 
*ai=0.5*rootx*(rj-ONOVRT*ry); 

*bi = -0.5*rootx*(ry+0N0VRT*rj); 
bessjyCz,TWOTHR,&rj,&ry,&rjp,Stryp); 
*aip=0.5*absx*(ONOVRT*ry+rj); 

*bip=0.5*absx*(ONOVRT*rj-ry); 

> else { Case a; -• 0. 

*ai=0.35502805; 

*bi=(*ai)/ONOVRT; 

*aip = -0.25881940; 

*bip = -(*aip)/ONOVRT; 

> 

> 


Spherical Bessel Functions 


For integer n, spherical Bessel functions are defined by 
J»(*> = 1/2) (*) 


5.7.47) 


Vn(x) = J— y„ +( 1 / 2 )(x) 


They can be evaluated by a call to bessjy, and the derivatives can safely be found from 
the derivatives of equation (6.7.47). 

Note that in the continued fraction CF2 in (6.7.3) just the first term survives for v = 1/2. 
Thus one can make a very simple algorithm for spherical Bessel functions along the lines of 
bessjy by always recursing j n down to n = 0, setting p and q from the first term in CF2, and 
then recursing y n up. No special series is required near x = 0. However, bessjy is already 
so efficient that we have not bothered to provide an independent routine for spherical Bessels. 


tinclude <math.h> 

#define RTPI02 1.2533141 

void sphbes(int n, float x, float *sj, float *sy, float *sjp, float *syp) 

Returns spherical Bessel functions jn(x), Un{x), and their derivatives j' n (x), y' n (x) for integer n. 
I 

void bessjy(float x, float xnu, float *rj, float *ry, float *rjp, 
float *ryp); 

void nrerror(char error_text[]); 
float factor,order,rj,rjp,ry,ryp; 

if (n < 0 II x <= 0.0) nrerrorC'bad arguments in sphbes"); 
order=n+0.5; 

bessjy(x,order,&rj,&ry,fcrjp,&ryp); 
factor=RTPI02/sqrt(x); 

*sj=factor*rj; 

*sy=factor*ry; 

*sjp=factor*rjp-(*sj)/(2.0*x); 

*syp=factor*ryp-(*sy)/(2.0*x); 
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6.8 Spherical Harmonics 

Spherical harmonics occur in a large variety of physical problems, for ex¬ 
ample, whenever a wave equation, or Laplace’s equation, is solved by separa¬ 
tion of variables in spherical coordinates. The spherical harmonic Yi m (0,(j>), 
—l<m<l, is a function of the two coordinates 0, (j> on the surface of a sphere. 

The spherical harmonics are orthogonal for different l and m, and they are 
normalized so that their integrated square over the sphere is unity: 

p2lV pi 

j d<j>J d(cos6)Y Vm ,*{6,(l>)Y lrn {6,<t>) = 8 l n5 m - m (6.8.1) 

Here asterisk denotes complex conjugation. 

Mathematically, the spherical harmonics are related to associated Legendre 
polynomials by the equation 


Vl m (0.« = ^|^|rr(co S 0) e ^ (6.8.2) 

By using the relation 

<j>) = (-1 ) m Y lm *{0, </>) (6.8.3) 

we can always relate a spherical harmonic to an associated Legendre polynomial 
with to > 0. With x = cos 9, these are defined in terms of the ordinary Legendre 
polynomials (cf. §4.5 and §5.5) by 



p i m ( x ) = (— 1 ) m ( 1 - r ) 


(6.8.4) 
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The first few associated Legendre polynomials, and their corresponding nor¬ 
malized spherical harmonics, are 


p o( x ) = 

1 

loo = 

rr 

V 47r 

P H X ) = 

— (1 — ar 2 ) 1 / 2 

>ii = 

- y^sin6»e^ 

Pl(x) = 

X 

*10 = 


p i( x ) = 

3 (1 — x 2 ) 

*22 = 


p U x ) = 

-3 {i-x 2 y/ 2 x 

>21 = 

— sin 9 cos 9e i(t> 

P 2 °(:r) = 

% (3a: 2 - 1) 

*20 = 

V%(! cos 2 0- I) 


There are many bad ways to evaluate associated Legendre polynomials numer¬ 
ically. For example, there are explicit expressions, such as 


(-1mm)! , 

2 m rn\(l — m)\ 


/2 1 - 


(l — to) (rn + l + 1) 

1!(to + 1) 


1 -a; 

2 


(l — to) (l — TO — 1) (to + l + 1) (to +1 + 2) 
2! (to + 1)(to + 2) 


1 -a; 

2 


( 6 . 8 . 6 ) 


where the polynomial continues up through the term in (1 — x ) l ~ m . (See [1] for 
this and related formulas.) This is not a satisfactory method because evaluation 
of the polynomial involves delicate cancellations between successive terms, which 
alternate in sign. For large l , the individual terms in the polynomial become very 
much larger than their sum, and all accuracy is lost. 

In practice, (6.8.6) can be used only in single precision (32-bit) for l up 
to 6 or 8, and in double precision (64-bit) for l up to 15 or 18, depending on 
the precision required for the answer. A more robust computational procedure is 
therefore desirable, as follows: 

The associated Legendre functions satisfy numerous recurrence relations, tab¬ 
ulated in [1-2], These are recurrences on l alone, on to alone, and on both l 
and m simultaneously. Most of the recurrences involving to are unstable, and so 
dangerous for numerical work. The following recurrence on l is, however, stable 
(compare 5.5.1): 


(Z - rn)P ( m = x(21 - l)P[ n 1 — (Z4- to — 1 )P ; ™ 2 (6.8.7) 


It is useful because there is a closed-form expression for the starting value, 

P™ = (—l) m (2m - 1)11(1 - x 2 ) m ' 2 (6.8.8) 


(The notation n\\ denotes the product of all odd integers less than or equal to n.) 
Using (6.8.7) with l = to + 1, and setting P x = 0, we find 

P™ +1 = x(2m + 1 )P™ (6.8.9) 

Equations (6.8.8) and (6.8.9) provide the two starting values required for (6.8.7) 
for general l. 

The function that implements this is 
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#include <math.h> 

float plgndr(int 1, int m, float x) 

Computes the associated Legendre polynomial PJ n (x). Here m and l are integers satisfying 
0 < m < l, while x lies in the range — 1 < x < 1. 

{ 

void nrerror(char error_text[]); 
float fact,pll,pmm,pmmpl,somx2; 
int 1,11; 

if (m < 0 II m > 1 || fabs(x) > 1.0) 

nrerrorC'Bad arguments in routine plgndr"); 
pimn=l. 0; Compute P™. 

if (m > 0) { 

somx2=sqrt((1.0-x)*(1.0+x)); 
fact=1.0; 

for (i=l;i<=m;i++) { 
pmm *= -fact*somx2; 
fact += 2.0; 

> 

> 

if (1 == m) 

return pmm; 

else { Compute P Z+1- 

pmmpl=x* (2*m+l) *pmm; 
if (1 == (m+1)) 
return pmmpl; 

else { Compute Pj" 1 , Z > m + 1. 

for (ll=m+2;11<=1;11++) { 

pll=(x*(2*11-1)*pmmpl-(11+m-l)*pmm)/(11-m); 
pmm=pmmpl; 
pmmpl=pll; 

> 

return pll; 

> 

> 

> 


CITED REFERENCES AND FURTHER READING: 

Magnus, W., and Oberhettinger, F. 1949, Formulas and Theorems for the Functions of Mathe¬ 
matical Physics (New York: Chelsea), pp. 54ff. [1] 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), Chapter 8. [2] 
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Fresnel Integrals 


The two Fresnel integrals are defined by 


C0 \2 ^)dt, S'(x) = I sin (6.9.1) 


The most convenient way of evaluating these functions to arbitrary precision is 
to use power series for small x and a continued fraction for large x. The series are 


(I) 5^(1) ^ 


, / 7T \ X 3 / 7T \ 3 X 7 /7T\ 5 X 11 

^ = (2) 3^1! “ (2) 7^3! + (2) 11 • 5! 


There is a complex continued fraction that yields both S(x) and C'(x) simul¬ 
taneously: 


C(x) + iS(x) = ^ er f z i z = -^“(l — i)x 


1 1/2 1 3/2 2 


•2 3-4 

1-5- 2z 2 + 9 - 


In the last line we have converted the “standard” form of the continued fraction to 
its “even” form (see §5.2), which converges twice as fast. We must be careful not 
to evaluate the alternating series (6.9.2) at too large a value of x\ inspection of the 
terms shows that x = 1.5 is a good point to switch over to the continued fraction. 

Note that for large x 


C ( *)~l + ism(p 


sw ~;r i C0S GK) <6 ' 9 - 5) 


Thus the precision of the routine frenel may be limited by the precision of the 
library routines for sine and cosine for large x. 
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#include <math.h> 

#include 11 complex, h" 

#define EPS 6.0e-8 
#define MAXIT 100 
#define FPMIN 1.0e-30 
#define XMIN 1.5 
#define PI 3.1415927 
#define PIBY2 (PI/2.0) 

Here EPS is the relative error; MAXIT is the maximum number of iterations allowed; FPMIN 
is a number near the smallest representable floating-point number; XMIN is the dividing line 
between using the series and continued fraction. 

#define TRUE 1 

#define ONE Complex(l.0,0.0) 

void frenel(float x, float *s, float *c) 

Computes the Fresnel integrals S(x) and C{x) for all real x. 

i 

void nrerror(char error_text[]); 
int k,n,odd; 

float a,ax,fact,pix2,sign,sum,sumc,sums,term,test; 
fcomplex b,cc,d,h,del,cs; 


ax=fabs(x); 

if (ax < sqrt(FPMIN)) { 

*s=0.0; 

*c=ax; 

> else if (ax <= XMIN) { 

sum=sums=0.0; 

sumc=ax; 

sign=l.0; 

fact=PIBY2*ax*ax; 

odd=TRUE; 

term=ax; 

n=3; 

for (k=l;k<=MAXIT;k++) { 
term *= fact/k; 
sum += sign*term/n; 
test=fabs(sum)*EPS; 
if (odd) { 

sign = -sign; 
sums=sum; 
sum=sumc; 

} else { 

sumc=sum; 
sum=siims; 

> 

if (term < test) break; 
odd=!odd; 
n += 2; 

> 

if (k > MAXIT) nrerror("series failed 
*s=sums; 

*c=sumc; 

> else { 

pix2=PI*ax*ax; 
b=Complex(l,0,-pix2); 
cc=Complex(l.0/FPMIN,0.0); 
d=h=Cdiv(0NE,b); 
n = -1; 

for (k=2;k<=MAXIT;k++) { 
n += 2; 
a = -n*(n+l); 

b=Cadd(b,Complex(4.0,0.0)); 
d=Cdiv(ONE,Cadd(RCmul(a,d),b)) ; 


Special case: avoid failure of convergence 
test because of underflow. 

Evaluate both series simultaneously. 


in frenel"); 

Evaluate continued fraction by modified 
Lentz’s method (§5.2). 


Denominators cannot be zero. 
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cc=Cadd(b,Cdiv(Complex(a,0.0),cc)); 
del=Cmul(cc,d); 
h=Cmul(h,del); 

if (fabs(del.r-1.0)+fabs(del.i) < EPS) break; 

> 

if (k > MAXIT) nrerrorC'cf failed in frenel"); 
h=Cmul(Complex(ax,-ax) ,h); 
cs=Cmul(Complex(0.5,0.5), 

Csub(0NE,Cmul(Complex(cos(0.5*pix2),sin(0.5*pix2)),h))); 
*c=cs.r; 

*s=cs.i; 

> 

if (x < 0.0) { Use antisymmetry. 

*c = -(*c); 

*s = -(*s); 

> 

> 


Cosine and Sine Integrals 


The cosine and sine integrals are defined by 

0 cos t— 1 


Ci(a;) = 7 + lnx + 

Jo 

m-= f’—dt 

JO t 


(6.9.6) 


Here 7 « 0.5772... is Euler’s constant. We only need a way to calculate the 
functions for x > 0 , because 


Si(— x) = — Si(a;), Ci(— x) = Ci(a;) — in (6.9.7) 


Once again we can evaluate these functions by a judicious combination of power 
series and complex continued fraction. The series are 


= *- 373 !+ 5751 -" • ^ 

Cil.t) = 7 + I,ii:+ (-^2f + 4^4|- 

The continued fraction for the exponential integral Ei(ix) is 
E\{ix) = — Ci(a:) + i[Si(a;) — 7 r/ 2 ] 

= e -« ( 1 112 2 
1 + ix+ 1 + ix + 

1 l 2 2 2 


1 + ix — 3 + ix — 5 + ix — 


(6.9.8) 


(6.9.9) 


The “even” form of the continued fraction is given in the last line and converges 
twice as fast for about the same amount of computation. A good crossover point 
from the alternating series to the continued fraction is x = 2 in this case. As for 
the Fresnel integrals, for large x the precision may be limited by the precision of 
the sine and cosine routines. 
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#include <math.h> 

#include 11 complex, h" 

#define EPS 6.0e-8 

#define EULER 0.57721566 

#define MAXIT 100 

#define PIBY2 1.5707963 

#define FPMIN 1.0e-30 

#define TMIN 2.0 

#define TRUE 1 

#define ONE Complex(l.0,0.0) 


Relative error, or absolute error near a zero of Ci(o:). 
Euler's constant 7 . 

Maximum number of iterations allowed. 

tt/2. 

Close to smallest representable floating-point number. 
Dividing line between using the series and continued frac¬ 
tion. 


void cisi(float x, float *ci, float *si) 

Computes the cosine and sine integrals Ci(cc) and Si(rr). Ci(0) is returned as a large negative 
number and no error message is generated. For x < 0 the routine returns Ci(— x) and you must 
supply the —in yourself. 

{ 

void nrerror(char error_text []); 
int i,k,odd; 

float a,err,fact.sign,sum,sumc,sums,t,term; 
fcomplex h,b,c,d,del; 


t=fabs(x); 
if (t == 0.0) { 

*si=0.0; 

*ci = -1.0/FPMIN; 
return; 

> 

if (t > TMIN) { 

b=Complex(l,0,t); 
c=Complex(1.0/FPMIN,0.0); 
d=h=Cdiv(0NE,b); 
for (i=2;i<=MAXIT;i++) { 
a = —(i—l)*(i—1); 
b=Cadd(b,Complex(2.0,0.0)); 
d=Cdiv(ONE,Cadd(RCmul(a,d),b)); 
c=Cadd(b,Cdiv(Complex(a,0.0),c)); 
del=Cmul(c,d); 
h=Cmul(h,del); 

if (fabs(del.r-1.0)+fabs(del.i) < 

> 

if (i > MAXIT) nrerrorC'cf failed in 
h=Cmul(Complex(cos(t),-sin(t)),h); 

*ci = -h.r; 

*si=PIBY2+h.i; 

> else { 

if (t < sqrt(FPMIN)) { 
sumc=0.0; 
sums=t; 

> else { 

sum=sums=sumc=0.0; 
sign=fact=l.0; 
odd=TRUE; 

for (k=l ;k<=MAXIT;k++) { 
fact *= t/k; 
term=fact/k; 
sum += sign*term; 
err=term/fabs(sum); 
if (odd) { 

sign = -sign; 
sums=sum; 
sum=sumc; 

> else { 


Special case. 

Evaluate continued fraction by modified 
Lentz’s method (§5.2). 

Denominators cannot be zero. 

EPS) break; 
cisi") ; 

Evaluate both series simultaneously. 
Special case: avoid failure of convergence 
test because of underflow. 



sumc=sum; 
sum=sums; 
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> 

if (err < EPS) break; 
odd=!odd; 

> 

if (k > MAXIT) nrerror("maxits exceeded in cisi"); 

> 

*si=sums; 

*ci=sumc+log(t)+EULER; 

> 

if (x < 0.0) *si = -(*si); 


CITED REFERENCES AND FURTHER READING: 

Stegun, I.A., and Zucker, R. 1976, Journal of Research of the National Bureau of Standards, 
vol. 80B, pp. 291-311; 1981, op. cit., vol. 86, pp. 661-686. 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), Chapters 5 and 7. 


6.10 Dawson’s Integral 


Dawson’s Integral F(x) is defined by 

F(x ) = e~ x ( e* dt (6.10.1) 

Jo 

The function can also be related to the complex error function by 

F(z) = h^L e ~ z2 [i _ erfc(— iz)]. (6.10.2) 

A remarkable approximation for F(z), due to Rybicki [1], is 

1 e ~{z~nh) 2 

F(z) = lim —j= y - (6.10.3) 

h >o y/K , n 

n odd 

What makes equation (6.10.3) unusual is that its accuracy increases exponentially 
as h gets small, so that quite moderate values of h (and correspondingly quite rapid 
convergence of the series) give very accurate approximations. 

We will discuss the theory that leads to equation (6.10.3) later, in §13.11, as 
an interesting application of Fourier methods. Here we simply implement a routine 
for real values of x based on the formula. 

It is first convenient to shift the summation index to center it approximately on 
the maximum of the exponential term. Define n o to be the even integer nearest to 
x/h, and xq ■ noh, x' = x — xq, and n' = n — no, so that 



n / odd 



n' + no 


(6.10.4) 
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where the approximate equality is accurate when h is sufficiently small and N is 
sufficiently large. The computation of this formula can be greatly speeded up if 
we note that 


( x '-n'h) 2 = x ' 2 p -(n'h) 2 



(6.10.5) 


The first factor is computed once, the second is an array of constants to be stored, 
and the third can be computed recursively, so that only two exponentials need be 
evaluated. Advantage is also taken of the symmetry of the coefficients e ~ ( - n by 
breaking the summation up into positive and negative values of n 1 separately. 

In the following routine, the choices h = 0.4 and N =11 are made. Because 
of the symmetry of the summations and the restriction to odd values of n, the limits 
on the for loops are 1 to 6. The accuracy of the result in this float version is about 
2 x 10 -7 . In order to maintain relative accuracy near x = 0, where F(x) vanishes, 
the program branches to the evaluation of the power series [2] forF(x), for \x\ < 0.2. 


#include <math.h> 

#include "nrutil.h" 

#define NMAX 6 
#define H 0.4 
#define At (2.0/3.0) 

#define A2 0.4 
#define A3 (2.0/7.0) 

float dawson(float x) 

Returns Dawson’s integral F($) fe exp(—a ’ 2 ) fg exp(t 2 )dt for any real x. 
int i,nO; 

float dl,d2,el,e2,sum,x2,xp,xx,ans; 
static float c[NMAX+l]; 

static int init = 0; Flag is 0 if we need to initialize, else 1. 

if (init == 0) { 
init=l; 

for (i=l;i<=NMAX;i++) c[i]=exp(-SQR((2.0*i-l.0)*H)); 

> 

if (fabs(x) < 0.2) { Use series expansion. 

x2=x*x; 

ans=x*(l.0-Al*x2*(1.0-A2*x2*(1.0-A3*x2))); 

> else { Use sampling theorem representation. 

xx=fabs(x); 

n0=2*(int)(0.5*xx/H+0.5); 

xp=xx-n0*H; 

el=exp(2.0*xp*H); 

e2=el*el; 

dl=n0+l; 

d2=dl-2.0; 

sum=0.0; 

for (i=l;i<=NMAX;i++,dl+=2.0,d2-=2.0,el*=e2) 
sum += c[i]*(el/dl+l.0/(d2*el)); 

ans=0.5641895835*SIGN(exp(-xp*xp) ,x)*sum; Constant is l/x/if. 

} 

return ans; 
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Other methods for computing Dawson’s integral are also known [2,3]. 


CITED REFERENCES AND FURTHER READING: 

Rybicki, G.B. 1989, Computers in Physics, vol. 3, no. 2, pp. 85-87. [1] 

Cody, W.J., Pociorek, K.A., and Thatcher, H.C. 1970, Mathematics of Computation, vol. 24, 
pp. 171-178. [2] 

McCabe, J.H. 1974, Mathematics of Computation, vol. 28, pp. 811-816. [3] 


6.11 Elliptic Integrals and Jacobian Elliptic 
Functions 


Elliptic integrals occur in many applications, because any integral of the form 



( 6 . 11 . 1 ) 


where R is a rational function of t and s, and s is the square root of a cubic or 
quartic polynomial in t, can be evaluated in terms of elliptic integrals. Standard 
references [1 ] describe how to carry out the reduction, which was originally done 
by Legendre. Legendre showed that only three basic elliptic integrals are required. 
The simplest of these is 


h 


I" x dt 

Jy -f + b$if) (ffj "-L &4t) 


( 6 . 11 . 2 ) 


where we have written the quartic s 2 in factored form. In standard integral tables [2], 
one of the limits of integration is always a zero of the quartic, while the other limit 
lies closer than the next zero, so that there is no singularity within the interval. To 
evaluate I\, we simply break the interval [y, x] into subintervals, each of which either 
begins or ends on a singularity. The tables, therefore, need only distinguish the eight 
cases in which each of the four zeros (ordered according to size) appears as the upper 
or lower limit of integration. In addition, when one of the 6’s in (6.11.2) tends to 
zero, the quartic reduces to a cubic, with the largest or smallest singularity moving 
to ±oo; this leads to eight more cases (actually just special cases of the first eight). 
The sixteen cases in total are then usually tabulated in terms of Legendre’s standard 
elliptic integral of the 1st kind, which we will define below. By a change of the 
variable of integration t, the zeros of the quartic are mapped to standard locations 
on the real axis. Then only two dimensionless parameters are needed to tabulate 
Legendre’s integral. However, the symmetry of the original integral (6.11.2) under 
permutation of the roots is concealed in Legendre’s notation. We will get back to 
Legendre’s notation below. But first, here is a better way: 


Carfson [3] has given a new definition of a standard elliptic integral of the first kind, 


R F {x,y,z) 


1 r x dt 

2 Jo -\/{t + x)(t + y)(t + z) 



(6.11.3) 
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where x, y, and z are nonnegative and at most one is zero. By standardizing the range of 
integration, he retains permutation symmetry for the zeros. (Weierstrass’ canonical form 
also has this property.) Carlson first shows that when x or y is a zero of the quartic in 
( 6 . 11 . 2 ), the integral 7i can be written in terms of Rf in a form that is symmetric under 
permutation of the remaining three zeros. In the general case when neither x nor y is a 
zero, two such Rf functions can be combined into a single one by an addition theorem, 
leading to the fundamental formula 


h = 2 Rf (C/i 2 2 , t/i 2 3, t/i 2 4) ( 6 .11.4) 

where 

Uif = {XiXjYkYm + Y l Y 3 X k X m y{x - y) (6.11.5) 

Xi = (ai+bix) 1/2 , Yi = (ai + biy ) 1/2 ( 6 . 11 . 6 ) 

and i, j, k, m is any permutation of 1,2, 3,4. A short-cut in evaluating these expressions is 
Ui3 — Ui 2 — (ai&4 — 04 &l)(a 2 b 3 — O 362 ) 

U 14 = U ?2 — (0163 — 03&l)(a2&4 — < 1462 ) 


The U’s correspond to the three ways of pairing the four zeros, and Ii is thus manifestly 
symmetric under permutation of the zeros. Equation (6.11.4) therefore reproduces all sixteen 
cases when one limit is a zero, and also includes the cases when neither limit is a zero. 

Thus Carlson’s function allows arbitrary ranges of integration and arbitrary positions of 
the branch points of the integrand relative to the interval of integration. To handle elliptic 
integrals of the second and third kind, Carlson defines the standard integral of the third kind as 


Rj{x,y,z,p) 


3 p _g|_ 

2 4 (t + p)y/(t + till + y){i + z) 


( 6 . 11 . 8 ) 


which is symmetric in x, y, and 2 . The degenerate case when two arguments are equal 
is denoted 


R D {x,y,z) = Rj(x,y,z,z) (6.11.9) 

and is symmetric in x and y. The function Rn replaces Legendre’s integral of the second 
kind. The degenerate form of Rf is denoted 

Rc(x,y) = R F (x,y,y) (6.11.10) 

It embraces logarithmic, inverse circular, and inverse hyperbolic functions. 

Carlson [4-7] gives integral tables in terms of the exponents of the linear factors of 
the quartic in (6.11.1). For example, the integral where the exponents are 
can be expressed as a single integral in terms of Rd\ it accounts for 144 separate cases in 
Gradshteyn and Ryzhik[2]! 

Refer to Carlson’s papers [3-7] for some of the practical details in reducing elliptic 
integrals to his standard forms, such as handling complex conjugate zeros. 


Turn now to the numerical evaluation of elliptic integrals. The traditional methods [ 8 ] 
are Gauss or Landen transformations. Descending transformations decrease the modulus 
k of the Legendre integrals towards zero, increasing transformations increase it towards 
unity. In these limits the functions have simple analytic expressions. While these methods 
converge quadratically and are quite satisfactory for integrals of the first and second kinds, 
they generally lead to loss of significant figures in certain regimes for integrals of the third 
kind. Carlson’s algorithms [9,10], by contrast, provide a unified method for all three kinds 
with no significant cancellations. 

The key ingredient in these algorithms is the duplication theorem: 

Rf( x,y,z) = 2R F {x + X,y + X, z + X) 



= Rf 
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where 

A = ( xyf 2 + (xzf 2 + ( yz ) 1/2 (6.11.12) 

This theorem can be proved by a simple change of variable of integration [11 ]. Equation 
(6.11.11) is iterated until the arguments of Rf are nearly equal. For equal arguments we have 

Rf(x,x,x) = x~ 1/2 (6.11.13) 

When the arguments are close enough, the function is evaluated from a fixed Taylor expansion 
about (6.11.13) through fifth-order terms. While the iterative part of the algorithm is only 
linearly convergent, the error ultimately decreases by a factor of 4 6 = 4096 for each iteration. 
Typically only two or three iterations are required, perhaps six or seven if the initial values 
of the arguments have huge ratios. We list the algorithm for Rf here, and refer you to 
Carlson’s paper [9] for the other cases. 


Stage 1: For n = 0,1,2,... compute 
Pn - {X„ + !J„ + Jfi)/3 

- (Xn/Pn), Y n = l-(y n /p n ), Z n y^i - (Zn/Pn) 
e„ = max(|X n |, \Y n \, \Z n \) 

If e n < tol go to Stage 2; else compute 

An = {X n y n ) 1/2 + (x n Z n ) 1/2 + (j l n Z„) 1/2 
Xn+\ = (x„ + A n )/4, y„+i = {y„ + X„)/4, z n+ i = (z n + A„)/4 

and repeat this stage. 


Stage 2: Compute 


E 2 =b X n Y n - Z 2 , E 3 = X n Y n Z n 

RfH fl - ^E 2 + Xe 3 + r_E 2 2 - XE 2 E 3 )/(y „) 1/2 


In some applications the argument p in Rj or the argument y in Rc is negative, and the 
Cauchy principal value of the integral is required. This is easily handled by using the formulas 

Rj(x,y,z,p) = 

[(7 - y)Rj(x,y,z,'y) - 3 R F {x,y,z) + 3R c (xz/y,xn/y)] /{y - p) 

(6.11.14) 


where 

7 = v 

is positive if p is negative, and 

Rc(x,y) = 


(~ — y){y — x ) 
y-p 


Rc(x - y,-y) 


(6.11.15) 


(6.11.16) 



The Cauchy principal value of Rj has a zero at some value of p < 0, so (6.11.14) will give 
some loss of significant figures near the zero. 
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#include <math.h> 

#include "nrutil.h" 

#define ERRTOL 0.08 
#define TINY 1.5e-38 
#define BIG 3.0e37 
#define THIRD (1.0/3.0) 

#define Cl (1.0/24.0) 

#define C2 0.1 
#define C3 (3.0/44.0) 

#define C4 (1.0/14.0) 

float rf(float x, float y, float z) 

Computes Carlson’s elliptic integral of the first kind, Rp(x,y,z). x, y, and z must be nonneg¬ 
ative, and at most one can be zero. TINY must be at least 5 times the machine underflow limit, 
BIG at most one fifth the machine overflow limit. 

I 

float alamb,ave,delx,dely,delz,e2,e3,sqrtx,sqrty,sqrtz, xt, yt, zt; 

if (FMIN(FMIN(x,y),z) <0.0 II FMIN(FMIN(x+y,x+z),y+z) < TINY I I 
FMAX(FMAX(x,y),z) > BIG) 

nrerror("invalid arguments in rf"); 

xt=x; 


do { 

sqrtx=sqrt(xt); 
sqrty=sqrt(yt); 
sqrtz=sqrt(zt); 

alamb=sqrtx*(sqrty+sqrtz)+sqrty*sqrtz; 

xt=0.25*(xt+alamb); 

yt=0.25*(yt+alamb); 

zt=0.25*(zt+alamb); 

ave=THIRD*(xt+yt+zt); 

delx= (ave-xt) /ave; 

dely=(ave-yt)/ave; 

delz=(ave-zt)/ave; 

> while (FMAX(FMAX(fabs(delx),fabs(dely)),fabs(delz)) > ERRTOL); 
e2=delx*dely-delz*delz; 
e3=delx*dely*delz; 

return (1.0+(Cl*e2-C2-C3*e3)*e2+C4*e3)/sqrt(ave); 


A value of 0.08 for the error tolerance parameter is adequate for single precision (7 
significant digits). Since the error scales as e®, we see that 0.0025 will yield double precision 
(16 significant digits) and require at most two or three more iterations. Since the coefficients 
of the sixth-order truncation error are different for the other elliptic functions, these values for 
the error tolerance should be changed to 0.04 and 0.0012 in the algorithm for Be, and 0.05 and 
0.0015 for Rj and Rd . As well as being an algorithm in its own right for certain combinations 
of elementary functions, the algorithm for Re is used repeatedly in the computation of Rj. 

The C implementations test the input arguments against two machine-dependent con¬ 
stants, TINY and BIG, to ensure that there will be no underflow or overflow during the 
computation. We have chosen conservative values, corresponding to a machine minimum of 
3 x 10 -39 and a machine maximum of 1.7 x 10 38 . You can always extend the range of 
admissible argument values by using the homogeneity relations (6.11.22), below. 



#include <math.h> 
#include "nrutil.h" 
#define ERRTOL 0.05 
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#define TINY 1.0e-25 
#define BIG 4.5e21 
#define Cl (3.0/14.0) 

#define C2 (1.0/6.0) 

#define C3 (9.0/22.0) 

#define C4 (3.0/26.0) 

#define C5 (0.25*C3) 

#define C6 (1.5*C4) 

float rd(float x, float y, float z) 

Computes Carlson’s elliptic integral of the second kind, Ro(x,y, z). x and y must be non¬ 
negative, and at most one can be zero, z must be positive. TINY must be at least twice the 
negative 2/3 power of the machine overflow limit. BIG must be at most 0.1 x ERRT0L times 
the negative 2/3 power of the machine underflow limit. 

{ 

float alamb,ave,delx,dely,delz,ea,eb,ec,ed,ee,fac,sqrtx,sqrty, 
sqrtz, sum, xt, yt, zt.; 

if (FMIN(x,y) <0.0 I I FMIN(x+y,z) < TINY I I FMAX(FMAX(x,y),z) > BIG) 
nrerror("invalid arguments in rd"); 
xt=x; 

yt=y; 

zt=z; 
sum=0.0; 
fac=1.0; 
do { 

sqrtx=sqrt(xt); 
sqrty=sqrt(yt); 
sqrtz=sqrt(zt); 

alamb=sqrtx*(sqrty+sqrtz)+sqrty*sqrtz; 

sum += fac/(sqrtz*(zt+alamb)) ; 

fac=0.25*fac; 

xt=0.25*(xt+alamb); 

yt=0.25*(yt+alamb); 

zt=0.25*(zt+alamb); 

ave=0.2*(xt+yt+3.0*zt); 

delx= (ave-xt) /ave; 

dely= (ave-yt) /ave; 

delz=(ave-zt)/ave; 

> while (FMAX(FMAX(fabs(delx),fabs(dely)),fabs(delz)) > ERRT0L); 

ea=delx*dely; 

eb=delz*delz; 

ec=ea-eb; 

ed=ea-6.0*eb; 

ee=ed+ec+ec; 

return 3.0*sum+fac*(1.0+ed*(-Cl+C5*ed-C6*delz*ee) 

+delz*(C2*ee+delz*(-C3*ec+delz*C4*ea)))/(ave*sqrt(ave)); 

> 


#include <math.h> 
#include "nrutil.h" 
#define ERRT0L 0.05 
#define TINY 2.5e-13 
#define BIG 9.0ell 
#define Cl (3.0/14.0) 
#define C2 (1.0/3.0) 
#define C3 (3.0/22.0) 
#define C4 (3.0/26.0) 
#define C5 (0.75*C3) 
#define C6 (1.5*C4) 
#define C7 (0.5*C2) 
#define C8 (C3+C3) 
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float rj(float x, float y, float z, float p) 

Computes Carlson’s elliptic integral of the third kind, Rj(x,y, z,p). x, y, and z must be 
nonnegative, and at most one can be zero, p must be nonzero. If p < 0, the Cauchy principal 
value is returned. TINY must be at least twice the cube root of the machine underflow limit, 
BIG at most one fifth the cube root of the machine overflow limit. 

i 

float rc(float x, float y); 

float rf(float x, float y, float z); 

float a,alamb,alpha,ans,ave,b,beta,delp,delx,dely,delz,ea,eb,ec, 
ed,ee,f ac,pt,rcx,rho,sqrtx,sqrty,sqrtz,sum,tau,xt,yt,zt; 

if (FMIN(FMIN(x,y),z) <0.0 II FMIN(FMIN(FMIN(x+y,x+z),y+z),fabs(p)) < TINY 
II FMAX(FMAX(FMAX(x,y),z),fabs(p)) > BIG) 
nrerror("invalid arguments in rj"); 

sum=0.0; 
fac=l.0; 
if (p > 0.0) { 
xt=x; 


zt=z; 
pt=p; 

> else { 

xt=FMIN(FMIN(x,y),z); 
zt=FMAX(FMAX(x,y),z); 
yt=x+y+z-xt-zt; 
a=l.0/(yt-p); 
b=a*(zt-yt)*(yt-xt); 
pt=yt+b; 
rho=xt*zt/yt; 
tau=p*pt/yt; 
rcx=rc(rho,tau); 

> 

do { 

sqrtx=sqrt(xt); 
sqrty=sqrt(yt); 
sqrtz=sqrt(zt); 

alamb=sqrtx*(sqrty+sqrtz)tsqrty*sqrtz; 

alpha=SQR(pt*(sqrtx+sqrty+sqrtz)+sqrtx*sqrty*sqrtz); 

beta=pt*SQR(pt+alamb); 

sum += fac*rc(alpha,beta); 

fac=0.25*fac; 

xt=0.25*(xt+alamb); 

yt=0.25*(yt+alamb); 

zt=0.25*(zt+alamb); 

pt=0.25*(pt+alamb); 

ave=0.2*(xt+yt+zt+pt+pt); 

delx=(ave-xt)/ave; 

dely= (ave-yt) /ave; 

delz=(ave-zt)/ave; 

delp=(ave-pt)/ave; 

> while (FMAX(FMAX(FMAX(fabs(delx),fabs(dely)), 
fabs(delz)),fabs(delp)) > ERRT0L); 
ea=delx*(dely+delz)+dely*delz; 
eb=delx*dely*delz; 
ec=delp*delp; 
ed=ea-3.0*ec; 
ee=eb+2.0*delp*(ea-ec); 

ans=3.0*sum+fac*(1.0+ed*(-Cl+C5*ed-C6*ee)+eb*(C7+delp*(-C8+delp*C4)) 
+delp*ea*(C2-delp*C3)-C2*delp*ec)/(ave*sqrt(ave)); 
if (p <= 0.0) ans=a*(b*ans+3.0*(rcx-rf(xt,yt,zt))); 
return ans; 
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#include <math.h> 

#include "nrutil.h" 

#define ERRTOL 0.04 
#define TINY 1.69e-38 
#define SQRTNY 1.3e-19 
#define BIG 3.e37 
#def ine TNBG (TINY*BIG) 

#define C0MP1 (2.236/SQRTNY) 

#define C0MP2 (TNBG*TNBG/25.0) 

#define THIRD (1.0/3.0) 

#define Cl 0.3 
#define C2 (1.0/7.0) 

#define C3 0.375 
#define C4 (9.0/22.0) 

float rc(float x, float y) 

Computes Carlson's degenerate elliptic integral, Rg(x,y). x must be nonnegative and y must 
be nonzero. If y < 0, the Cauchy principal value is returned. TINY must be at least 5 times 
the machine underflow limit, BIG at most one fifth the machine maximum overflow limit. 

{ 

float alamb,ave,s,w,xt,yt; 

if (x < 0.0 || y == 0.0 || (x+fabs(y)) < TINY I I (x+fabs(y)) > BIG I I 
(y<-C0MPl kk x > 0.0 kk x < C0MP2)) 

nrerror("invalid arguments in rc"); 

if (y > 0.0) { 
xt=x; 

y t= y; 

w=l.0; 

> else { 

xt=x-y; 

yt = -y; 

w=sqrt(x)/sqrt(xt); 

> 

do { 

alamb=2.0*sqrt(xt)*sqrt(yt)+yt; 
xt=0.25*(xt+alamb); 
yt=0.25*(yt+alamb); 
ave=THIRD*(xt+yt+yt); 
s=(yt-ave)/ave; 

> while (fabs(s) > ERRTOL); 

return w* (1.0+s*s* (Cl+s* (C2+s* (C3+s*C4))) ) /sqrt (ave); 

> 


At times you may want to express your answer in Legendre’s notation. Alter¬ 
natively, you may be given results in that notation and need to compute their values 
with the programs given above. It is a simple matter to transform back and forth. 
The Legendre elliptic integral of the 1st kind is defined as 

d,9 

k)= (6.11.17) 

Jo \/l — k 2 sin 2 6 

The complete elliptic integral of the 1st kind is given by 

K(k) = F(n/2,k) (6.11.18) 

In terms of Rf, 


F(<j>, k) = sin</>iZja(cos 2 (f>, 1 — k 2 sin 2 1) 
K[k) = Rf{0, 1 — k 2 , 1) 


(6.11.19) 
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The Legendre elliptic integral of the 2nd kind and the complete elliptic integral of 
the 2nd kind are given by 


E(<f>, k) = r Vi - k 2 sin 2 9 d6 
Jo 

= sm(f>Rp(cos 2 (f>, 1 — k 2 sin 2 <j>, 1 ) 

— \k 2 sin 3 <f>R,D (cos 2 <f>,\ — k 2 sin 2 <j>, 1) 

E(k) = E(n/2, k) = R f ( 0,1 - k 2 , 1 ) - \k 2 R D (Q , 1 - k 2 , 1 ) 

Finally, the Legendre elliptic integral of the 3rd kind is 


n k) 


dO 


Jo (1 + nsin 2 9)y/l - fc 2 sin 2 9 
= sin <f>Rp (cos 2 <j>, 1 — k 2 sin 2 cj),l) 

— |nsin 3 fRj (cos 2 <f>, 1 — k 2 sin 2 1,1 + nsin 2 f) 


( 6 . 11 . 21 ) 


(Note that this sign convention for n is opposite that of Abramowitz and Stegun [12], 
and that their sin a is our k.) 

#include <math.h> 

#include "nrutil.h" 

float ellf(float phi, float ak) 

Legendre elliptic integral of the 1st kind F(<f>,k), evaluated using Carlson's function Rp. The 
argument ranges are 0 < <f> < ir/2, 0 < ksin<j> < 1. 

{ 

float rf(float x, float y, float z); 
float s; 

s=sin(phi); 

return s*rf (SQR(cos(phi)), (1.0-s*Eik)*(l.0+s*ak) ,1.0); 


#include <math.h> 

#include "nrutil.h" 

float elle(float phi, float ak) 

Legendre elliptic integral of the 2nd kind E(<j>,k), evaluated using Carlson's functions Rd and 
Rp. The argument ranges are 0 < 4> < n/2, 0 < fesin^i < 1. 

{ 

float rd(float x, float y, float z); 
float rf(float x, float y, float z); 
float cc,q,s; 

s=sin(phi); 
cc=SQR(cos(phi)); 
q=(1.0-s*ak)*(1.0+s*ak); 

return s*(rf(cc,q,1.0)-(SQR(s*ak))*rd(cc,q,1.0)/3.0); 
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#include <math.h> 

#include "nrutil.h" 

float ellpi(float phi, float en, float ak) 

Legendre elliptic integral of the 3rd kind Tl(4>,n,k), evaluated using Carlson's functions Rj and 
Rf- (Note that the sign convention on n is opposite that of Abramowitz and Stegun.) The 
ranges of (j> and k are 0 < <j> < 7r/2, 0 < fcsin</> < 1. 

{ 

float rf(float x, float y, float z); 

float rj(float x, float y, float z, float p); 

float cc,enss,q,s; 

s=sin(phi); 
enss=en*s*s; 
cc=SC)R(cos(phi)); 
q=(1.0-s*ak)*(1.0+s*ak); 

return s*(rf(cc,q,1.0)-enss*rj(cc,q,1.0,1.0+enss)/3.0) ; 


Carlson’s functions are homogeneous of degree — \ and — |, so 
Rf(Xx, Xy, Xz) = X~ 1 ^ 2 Rf(x, y, z) 

( 6 . 11 . 22 ) 

Rj(Xx,Xy,Xz,Xp) = X 3 ^ 2 Rj(x,y, z,p) 

Thus to express a Carlson function in Legendre’s notation, permute the first three 
arguments into ascending order, use homogeneity to scale the third argument to be 
1, and then use equations (6.11.19)—(6.11.21). 

Jacobian Elliptic Functions 

The Jacobian elliptic function sn is defined as follows: instead of considering 
the elliptic integral 


u(y, k) =u = F(<j), k) 


(6.11.23) 


consider the inverse function 


Equivalently, 


y = sin (f) = sn(w, k) 


dy 


(6.11.24) 


(6.11.25) 


Jo v(i-y 2 )(i-W) 

When k = 0, sn is just sin. The functions cn and dn are defined by the relations 

sn 2 + cn 2 = 1, fc 2 sn 2 + dn 2 = 1 (6.11.26) 



The routine given below actually takes m c = k 2 = 1 — k 2 as an input parameter. 
It also computes all three functions sn, cn, and dn since computing all three is no 
harder than computing any one of them. For a description of the method, see [8], 
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#include <math.h> 

#define CA 0.0003 The accuracy is the square of CA. 

void sncndn(float uu, float emmc, float *sn, float *cn, float *dn) 

Returns the Jacobian elliptic functions sn(«, fc c ), cn (u,k c ), and dn (u,k c ). Here uu = u, while 
emmc = k%- 
{ 

float a,b,c,d,emc,u; 
float em[14] ,en[14] ; 
int i,ii,l,bo; 

emc=emmc; 
u=uu; 

if (emc) { 

bo=(emc < 0.0); 
if (bo) { 

d=l.0-emc; 
emc /= -1.0/d; 
u *= (d=sqrt(d)); 

> 


*dn=l.0; 

for (i=l;i<=13;i++) { 

l=i; 

em[i]=a; 

en[i]=(emc=sqrt(emc)); 
c=0.5*(a+emc); 

if (fabs(a-emc) <= CA*a) break; 

emc *= a; 

a=c; 

> 

*sn=sin(u); 

*cn=cos(u); 
if (*sn) { 

a=(*cn)/(*sn); 
c *= a; 

for (ii=l;ii>=l;ii—) { 
b=em[ii] ; 
a *= c; 
c *= (*dn); 

*dn=(en[ii]+a)/(b+a); 
a=c/b; 

> 

a=l.0/sqrt(c*c+l.0); 

*sn=(*sn >= 0.0 ? a : -a); 
*cn=c*(*sn); 

> 

if (bo) { 
a=(*dn); 

*dn=(*cn); 

*cn=a; 

*sn /= d; 

> 

> else { 

*cn=l.0/cosh(u); 

*dn=(*cn); 

*sn=tanh(u); 

> 

> 
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6.12 Hypergeometric Functions 


As was discussed in §5.14, a fast, general routine for the the complex hyperge¬ 
ometric function 2 -F 1 (a, b, c; z), is difficult or impossible. The function is defined as 
the analytic continuation of the hypergeometric series, 


, , , ab z a(a + l)b(b + 1) 

2 Fi {a,b,c-,z =1+-- + ^-- L 

c 1! c(c+l) 



, a(a + 1)... (a + j - 1)6(6 + 1)... (b + j - 1) z 3 | _ 

c(c + 1)... (c + j — 1) j\ 

(6.12.1) 

This series converges only within the unit circle \z\ < 1 (see [1]), but one’s interest 
in the function is not confined to this region. 

Section 5.14 discussed the method of evaluating this function by direct path 
integration in the complex plane. We here merely list the routines that result. 

Implementation of the function hypgeo is straightforward, and is described by 
comments in the program. The machinery associated with Chapter 16’s routine for 
integrating differential equations, odeint, is only minimally intrusive, and need not 
even be completely understood: use of odeint requires one zeroed global variable, 
one function call, and a prescribed format for the derivative routine hypdrv. 

The function hypgeo will fail, of course, for values of z too close to the 
singularity at 1. (If you need to approach this singularity, or the one at 00 , use the 
“linear transformation formulas” in §15.3 of [1].) Away from z = 1, and for moderate 
values of a, b, c, it is often remarkable how few steps are required to integrate the 
equations. A half-dozen is typical. 
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#include <math.h> 

#include 11 complex, h" 

#include "nrutil.h" 

#define EPS 1.0e-6 Accuracy parameter. 

fcomplex aa,bb,cc,zO,dz; Communicates with hypdrv. 

int kmax,kount; Used by odeint. 

float *xp,**yp,dxsav; 

fcomplex hypgeo(fcomplex a, fcomplex b, fcomplex c, fcomplex z) 

Complex hypergeometric function 2 -F 1 for complex a,b,c, and z, by direct integration of the 
hypergeometric equation in the complex plane. The branch cut is taken to lie along the real 
axis, Re z > 1. 

{ 

void bsstep(float y[] , float dydx[], int nv, float *xx, float htry, 
float eps, float yscal[], float *hdid, float *hnext, 
void (*derivs)(float, float [], float [])); 
void hypdrv(float s, float yy[], float dyydsD); 
void hypser(fcomplex a, fcomplex b, fcomplex c, fcomplex z, 
fcomplex *series, fcomplex *deriv); 
void odeint(float ystart[], int nvar, float xl, float x2, 
float eps, float hi, float hmin, int *nok, int *nbad, 
void (*derivs) (float, float [], float []), 

void (*rkqs) (float [], float [] , int, float *, float, float, 
float [], float *, float *, void (*) (float, float [], float □))); 
int nbad,nok; 
fcomplex ans,y[3]; 
float *yy; 



> 


kmax=0; 

if (z ,r*z .r+z. i : 
hypser(a,b,< 


■z.i <= 0.25) { 
, z, tons, &y [2]); 


f if (z.r < 0.0) z0=Complex(-0.5,0.0); 
i if (z.r <= 1.0) z0=Complex(0.5,0.0); 
i z0=Complex(0.0,z.i >= 0.0 ? 0.5 : -0.5); 


>r pick a starting point for the path 
integration. 


Load the global variables to pass pa¬ 
rameters ‘‘over the head” of odeii 


dz=Csub(z,zO); 

hypser (aa,bb,cc,z0,&y [1] ,&y [2]); Get starting function and derivative. 

yy=vector(l,4); 

yy[l]=y[l] - r; 

yy[2]=y[l] .i; 

yy[3] =y [2] .r; 

yy [4] =y [2] . i; 

odeint(yy,4,0.0,1.0,EPS,0.1,0.0001.took,tobad,hypdrv,bsstep); 

The arguments to odeint are the vector of independent variables, its length, the starting 
and ending values of the dependent variable, the accuracy parameter, an initial guess for 
stepsize, a minimum stepsize, the (returned) number of good and bad steps taken, and the 
names of the derivative routine and the (here Bulirsch-Stoer) stepping routine, 
y [1] Complex (yy [1] ,yy[2]); 
free_vector(yy,1,4); 
return y[l] ; 
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#include 11 complex, h" 

#define ONE Complex(l.0,0.0) 

void hypser(fcomplex a, fcomplex b, fcomplex c, fcomplex z, fcomplex ^series, 
fcomplex *deriv) 

Returns the hypergeometric series 2 F 1 and its derivative, iterating to machine accuracy. For 
\z\ < 1/2 convergence is quite rapid. 

{ 

void nrerror(char error_text []); 
int n; 

fcomplex aa,bb,cc,fac,temp; 

deriv->r=0.0; 

deriv->i=0.0; 

fac=Complex(l.0,0.0); 

temp=fac; 

aa=a; 

bb=b; 

cc=c; 

for (n=l ;n<=1000;n++) { 

fac=Cmul(fac,Cdiv(Cmul(aa,bb),cc)); 
deriv->r+=fac.r; 
deriv->i+=fac.i; 
fac=Cmul(fac,RCmul(1.0/n,z)); 

*series=Cadd(temp,fac); 

if (series->r == temp.r && series->i == temp.i) return; 

temp= *series; 

aa=Cadd(aa,0NE); 

bb=Cadd(bb,ONE); 

cc=Cadd(cc,0NE); 


> 

nr error (" convergence failure in hypser"); 


#include 11 complex, h" 

#define ONE Complex(1.0,0.0) 

extern fcomplex aa,bb,cc,z0,dz; Defined in hypgeo. 

void hypdrv(float s, float yy[], float dyyds[]) 

Computes derivatives for the hypergeometric equation, see text equation (5.14.4). 

{ 

fcomplex z,y[3],dyds[3]; 

y [1] Complex (yy [1] ,yy[2] ) ; 
y [2] Complex (yy [3] ,yy[4]); 
z=Cadd(zO,RCmul(s,dz)); 
dyds[l]=Cmul(y[2] ,dz); 

dyds[2]=Cmul(Csub(Cmul(Cmul(aa,bb),y[1]),Cmul(Csub(cc, 
Cmul(Cadd(Cadd(aa,bb),0NE),z)),y[2])), 

Cdiv(dz,Cmul(z,Csub(ONE,z)))); 
dyyds [1] =dyds [1] . r; 
dyyds [2] =dyds [1] . i; 
dyyds [3] =dyds [2] . r; 
dyyds [4] =dyds [2] . i; 



CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York). [1] 


imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 







Chapter 7. Random Numbers 


7.0 Introduction 


It may seem perverse to use a computer, that most precise and deterministic of 
all machines conceived by the human mind, to produce “random” numbers. More 
than perverse, it may seem to be a conceptual impossibility. Any program, after all, 
will produce output that is entirely predictable, hence not truly “random.” 

Nevertheless, practical computer “random number generators” are in common 
use. We will leave it to philosophers of the computer age to resolve the paradox in 
a deep way (see, e.g., Knuth [1] §3.5 for discussion and references). One sometimes 
hears computer-generated sequences termed pseudo-random , while the word random 
is reserved for the output of an intrinsically random physical process, like the elapsed 
time between clicks of a Geiger counter placed next to a sample of some radioactive 
element. We will not try to make such fine distinctions. 

A working, though imprecise, definition of randomness in the context of 
computer-generated sequences, is to say that the deterministic program that produces 
a random sequence should be different from, and — in all measurable respects — 
statistically uncorrelated with, the computer program that uses its output. In other 
words, any two different random number generators ought to produce statistically 
the same results when coupled to your particular applications program. If they don’t, 
then at least one of them is not (from your point of view) a good generator. 

The above definition may seem circular, comparing, as it does, one generator to 
another. However, there exists a body of random number generators which mutually 
do satisfy the definition over a very, very broad class of applications programs. 
And it is also found empirically that statistically identical results are obtained from 
random numbers produced by physical processes. So, because such generators are 
known to exist, we can leave to the philosophers the problem of defining them. 

A pragmatic point of view, then, is that randomness is in the eye of the beholder 
(or programmer). What is random enough for one application may not be random 
enough for another. Still, one is not entirely adrift in a sea of incommensurable 
applications programs: There is a certain list of statistical tests, some sensible and 
some merely enshrined by history, which on the whole will do a very good job 
of ferreting out any correlations that are likely to be detected by an applications 
program (in this case, yours). Good random number generators ought to pass all of 
these tests; or at least the user had better be aware of any that they fail, so that he or 
she will be able to judge whether they are relevant to the case at hand. 
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As for references on this subject, the one to turn to first is Knuth [1], Then 
try [2], Only a few of the standard books on numerical methods [3-4] treat topics 
relating to random numbers. 

CITED REFERENCES AND FURTHER READING: 

Knuth, D.E. 1981, Seminumerical Algorithms, 2nd ed., vol. 2 of The Art of Computer Programming 
(Reading, MA: Addison-Wesley), Chapter 3, especially §3.5. [1] 

Bratley, P., Fox, B.L., and Schrage, E.L. 1983, A Guide to Simulation (New York: Springer- 
Verlag). [2] 

Dahlquist, G., and Bjorck, A. 1974, Numerical Methods (Englewood Cliffs, NJ: Prentice-Hall), 
Chapter 11. [3] 

Forsythe, G.E., Malcolm, M.A., and Moler, C.B. 1977, Computer Methods for Mathematical 
Computations (Englewood Cliffs, NJ: Prentice-Hall), Chapter 10. [4] 


7.1 Uniform Deviates 

Uniform deviates are just random numbers that lie within a specified range 
(typically 0 to 1), with any one number in the range just as likely as any other. They 
are, in other words, what you probably think “random numbers” are. However, 
we want to distinguish uniform deviates from other sorts of random numbers, for 
example numbers drawn from a normal (Gaussian) distribution of specified mean 
and standard deviation. These other sorts of deviates are almost always generated by 
performing appropriate operations on one or more uniform deviates, as we will see 
in subsequent sections. So, a reliable source of random uniform deviates, the subject 
of this section, is an essential building block for any sort of stochastic modeling or 
Monte Carlo computer work. 

System-Supplied Random Number Generators 

Most C implementations have, lurking within, a pair of library routines for 
initializing, and then generating, “random numbers.” In ANSI C, the synopsis is: 

#include <stdlib.h> 

#define RAND.MAX ... 

void srand(unsigned seed); 
int rand(void); 

You initialize the random number generator by invoking srand(seed) with 
some arbitrary seed. Each initializing value will typically result in a different 
random sequence, or a least a different starting point in some one enormously long 
sequence. The same initializing value of seed will always return the same random 
sequence, however. 

You obtain successive random numbers in the sequence by successive calls to 
rand(). That function returns an integer that is typically in the range 0 to the 
largest representable positive value of type int (inclusive). Usually, as in ANSI C, 
this largest value is available as RAND_MAX, but sometimes you have to figure it out 
for yourself. If you want a random float value between 0.0 (inclusive) and 1.0 
(exclusive), you get it by an expression like 
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x = rand()/(RAND_MAX+1.0); 

Now our first, and perhaps most important, lesson in this chapter is: be very, 
very suspicious of a system-supplied rand () that resembles the one just described. 
If all scientific papers whose results are in doubt because of bad rand()s were 
to disappear from library shelves, there would be a gap on each shelf about as 
big as your fist. System-supplied randQs are almost always linear congruential 
generators, which generate a sequence of integers I \, 1%, h ,..., each between 0 and 
m — 1 (e.g., RAND_MAX) by the recurrence relation 

Ij+ 1 = alj + c (mod m) (7.1.1) 

Here m is called the modulus, and a and c are positive integers called the multiplier 
and the increment respectively. The recurrence (7.1.1) will eventually repeat itself, 
with a period that is obviously no greater than m. If m, a, and c are properly chosen, 
then the period will be of maximal length, i.e., of length m. In that case, all possible 
integers between 0 and m— 1 occur at some point, so any initial “seed” choice of Iq 
is as good as any other: the sequence just takes off from that point. 

Although this general framework is powerful enough to provide quite decent 
random numbers, its implementation in many, if not most, ANSI C libraries is quite 
flawed; quite a number of implementations are in the category “totally botched.” 
Blame should be apportioned about equally between the ANSI C committee and 
the implementors. The typical problems are these: First, since the ANSI standard 
specifies that rand () return a value of type int — which is only a two-byte quantity 
on many machines — RAND_MAX is often not very large. The ANSI C standard 
requires only that it be at least 32767. This can be disastrous in many circumstances: 
for a Monte Carlo integration (§7.6 and §7.8), you might well want to evaluate 10 6 
different points, but actually be evaluating the same 32767 points 30 times each, not 
at all the same thing! You should categorically reject any library random number 
routine with a two-byte returned value. 

Second, the ANSI committee’s published rationale includes the following 
mischievous passage: “The committee decided that an implementation should be 
allowed to provide a rand function which generates the best random sequence 
possible in that implementation, and therefore mandated no standard algorithm. It 
recognized the value, however, of being able to generate the same pseudo-random 
sequence in different implementations, and so it has published an example.... 
[emphasis added]” The “example” is 

unsigned long next=l; 

int rand(void) /* NOT RECOMMENDED (see text) */ 

{ 

next = next*1103515245 + 12345; 

return (unsigned int)(next/65536) l 32768; 

> 

void srand(unsigned int seed) 

{ 

next=seed; 

> 
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This corresponds to equation (7.1.1) with a = 1103515245, c = 12345, and 
to = 2 32 (since arithmetic done on unsigned long quantities is guaranteed to 
return the correct low-order bits). These are not particularly good choices for a and 
c (the period is only 2 30 ), though they are not gross embarrassments by themselves. 
The real botches occur when implementors, taking the committee’s statement above 
as license, try to “improve” on the published example. For example, one popular 
32-bit PC-compatible compiler provides a long generator that uses the above 
congruence, but swaps the high-order and low-order 16 bits of the returned value. 
Somebody probably thought that this extra flourish added randomness; in fact it ruins 
the generator. While these kinds of blunders can, of course, be fixed, there remains 
a fundamental flaw in simple linear congruential generators, which we now discuss. 

The linear congruential method has the advantage of being very fast, requiring 
only a few operations per call, hence its almost universal use. It has the disadvantage 
that it is not free of sequential correlation on successive calls. If k random numbers at 
a time are used to plot points in k dimensional space (with each coordinate between 
0 and 1), then the points will not tend to “fill up” the fc-dimensional space, but rather 
will lie on (k — 1)-dimensional “planes.” There will be at most about m 1 ^ such 
planes. If the constants to, a, and c are not very carefully chosen, there will be many 
fewer than that. If to is as bad as 32768, then the number of planes on which triples 
of points lie in three-dimensional space will be no greater than about the cube root 
of 32768, or 32. Even if to is close to the machine’s largest representable integer, 
e.g., ~ 2 32 , the number of planes on which triples of points lie in three-dimensional 
space is usually no greater than about the cube root of 2 32 , about 1600. You might 
well be focusing attention on a physical process that occurs in a small fraction of the 
total volume, so that the discreteness of the planes can be very pronounced. 

Even worse, you might be using a generator whose choices of to, a, and c have 
been botched. One infamous such routine, RANDU, with a = 65539 and to = 2 31 , 
was widespread on IBM mainframe computers for many years, and widely copied 
onto other systems [1], One of us recalls producing a “random” plot with only 11 
planes, and being told by his computer center’s programming consultant that he 
had misused the random number generator: “We guarantee that each number is 
random individually, but we don’t guarantee that more than one of them is random.” 
Figure that out. 

Correlation in k- space is not the only weakness of linear congruential generators. 
Such generators often have their low-order (least significant) bits much less random 
than their high-order bits. If you want to generate a random integer between 1 and 
10, you should always do it using high-order bits, as in 

j=l+(int) (10.0*rand()/(RAND.MAX+l.0)); 
and never by anything resembling 

j=l+(rand() ’/, 10); 

(which uses lower-order bits). Similarly you should never try to take apart a 
“randO” number into several supposedly random pieces. Instead use separate 
calls for every piece. 
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Portable Random Number Generators 

Park and Miller [1 ] have surveyed a large number of random number generators 
that have been used over the last 30 years or more. Along with a good theoretical 
review, they present an anecdotal sampling of a number of inadequate generators that 
have come into widespread use. The historical record is nothing if not appalling. 

There is good evidence, both theoretical and empirical, that the simple multi¬ 
plicative congruential algorithm 

Ij+i = alj (mod to) (7.1.2) 

can be as good as any of the more general linear congruential generators that have 
c ^ 0 (equation 7.1.1) — if the multiplier a and modulus m are chosen exquisitely 
carefully. Park and Miller propose a “Minimal Standard” generator based on the 
choices 


a = 7 5 = 16807 m = 2 31 - 1 = 2147483647 (7.1.3) 

First proposed by Lewis, Goodman, and Miller in 1969, this generator has in 
subsequent years passed all new theoretical tests, and (perhaps more importantly) 
has accumulated a large amount of successful use. Park and Miller do not claim that 
the generator is “perfect” (we will see below that it is not), but only that it is a good 
minimal standard against which other generators should be judged. 

It is not possible to implement equations (7.1.2) and (7.1.3) directly in a 
high-level language, since the product of a and to — 1 exceeds the maximum value 
for a 32-bit integer. Assembly language implementation using a 64-bit product 
register is straightforward, but not portable from machine to machine. A trick 
due to Schrage [2,3] for multiplying two 32-bit integers modulo a 32-bit constant, 
without using any intermediates larger than 32 bits (including a sign bit) is therefore 
extremely interesting: It allows the Minimal Standard generator to be implemented 
in essentially any programming language on essentially any machine. 

Schrage’s algorithm is based on an approximate factorization of to, 

m = aq + r , i.e., q=[m/a\, r = mmo& a (7.1.4) 

with square brackets denoting integer part. If r is small, specifically r < q, and 
0 < z < to — 1, it can be shown that both a(z mod q) and r[z/q] lie in the range 
0,..., to — 1, and that 


az mod to = 


a(z mod q) — r[z/q\ if it is > 0, 

a(z mod q) — r[z/q\ + to otherwise 


(7.1.5) 


The application of Schrage’s algorithm to the constants (7.1.3) uses the values 
q = 127773 and r = 2836. 

Here is an implementation of the Minimal Standard generator: 
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#def ine IA 16807 
#define IM 2147483647 
#define AM (1.0/IM) 
#define IQ 127773 
#def ine IR 2836 
#define MASK 123459876 


float ranO(long ♦idum) 

“Minimal" random number generator of Park and Miller. Returns a uniform random deviate 
between 0.0 and 1.0. Set or reset idum to any integer value (except the unlikely value MASK) 
to initialize the sequence; idum must not be altered between calls for successive deviates in 
a sequence. 

{ 

long k; 
float ans; 


*idum *= MASK; 
k=(*idum)/IQ; 

*idum=IA*(*idum-k*IQ)-IR*k; 
if (♦idum < 0) ♦ idum += IM; 
ans=AM*(*idum); 

♦idum "= MASK; 
return ans; 


XORing with MASK allows use of zero and other 
simple bit patterns for idum. 

Compute idum=(IA^idum) ’/, IM without over¬ 
flows by Schrage's method. 

Convert idum to a floating result. 

Unmask before return. 


The period of ranO is 2 31 — 2 w 2.1 x 10 9 . A peculiarity of generators of 
the form (7.1.2) is that the value 0 must never be allowed as the initial seed — it 
perpetuates itself — and it never occurs for any nonzero initial seed. Experience 
has shown that users always manage to call random number generators with the seed 
idum=0. That is why ranO performs its exclusive-or with an arbitrary constant both 
on entry and exit. If you are the first user in history to be proof against human error, 
you can remove the two lines with the A operation. 

Park and Miller discuss two other multipliers a that can be used with the same 
m = 2 31 — 1. These are a = 48271 (with q = 44488 and r = 3399) and a = 69621 
(with q = 30845 and r = 23902). These can be substituted in the routine ranO 
if desired; they may be slightly superior to Lewis et al.’s longer-tested values. No 
values other than these should be used. 

The routine ranO is a Minimal Standard, satisfactory for the majority of 
applications, but we do not recommend it as the final word on random number 
generators. Our reason is precisely the simplicity of the Minimal Standard. It is 
not hard to think of situations where successive random numbers might be used 
in a way that accidentally conflicts with the generation algorithm. For example, 
since successive numbers differ by a multiple of only 1.6 x 10 4 out of a modulus of 
more than 2 x 10 9 , very small random numbers will tend to be followed by smaller 
than average values. One time in 10 6 , for example, there will be a value < 10“ 6 
returned (as there should be), but this will always be followed by a value less than 
about 0.0168. One can easily think of applications involving rare events where this 
property would lead to wrong results. 

There are other, more subtle, serial correlations present in ranO. For example, 
if successive points (/*, Jj+i) are binned into a two-dimensional plane for i = 
1,2,... ,N, then the resulting distribution fails the y 2 test when N is greater than a 
few x 10 7 , much less than the period m — 2. Since low-order serial correlations have 
historically been such a bugaboo, and since there is a very simple way to remove 
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them, we think that it is prudent to do so. 

The following routine, rani, uses the Minimal Standard for its random value, 
but it shuffles the output to remove low-order serial correlations. A random deviate 
derived from the jth value in the sequence, I j, is output not on the jth call, but rather 
on a randomized later call, j + 32 on average. The shuffling algorithm is due to Bays 
and Durham as described in Knuth [4], and is illustrated in Figure 7.1.1. 

#define IA 16807 
#define IM 2147483647 
#define AM (1.0/IM) 

#define IQ 127773 

#define IR 2836 

#def ine NTAB 32 

#define NDIV (1+(IM-1)/NTAB) 

#define EPS 1.2e-7 
#define RNMX (1.0-EPS) 

float rani(long *idum) 

“Minimal" random number generator of Park and Miller with Bays-Durham shuffle and added 
safeguards. Returns a uniform random deviate between 0.0 and 1.0 (exclusive of the endpoint 
values). Call with idum a negative integer to initialize; thereafter, do not alter idum between 
successive deviates in a sequence. RNMX should approximate the largest floating value that is 
less than 1. 

< 

int j; 
long k; 

static long iy=0; 
static long iv[NTAB]; 
float temp; 



if (*idum <=011 !iy) { 

if (-(*idum) < 1) *idum=l; 
else *idum = -(*idum); 
for (j=NTAB+7;j>=0;j—) { 
k=(*idum)/IQ; 

*idum=IA*(*idum-k*IQ)-IR*k; 
if (*idum < 0) *idum += IM; 
if (j < NTAB) iv[j] = *idum; 

> 

iy=iv [0] ; 

> 

k=(*idum)/IQ; 

*idum=IA*(*idum-k*IQ)-IR*k; 
if (*idum < 0) *idum += IM; 
j=iy/NDIV; 
iy=iv[j] ; 
iv[j] = *idum; 

if ((temp=AM*iy) > RNMX) return RNMX; 
else return temp; 


Initialize. 

Be sure to prevent idum = 0- 

Load the shuffle table (after 8 warm-ups). 


Start here when not initializing. 

Compute idum=(IA*idum) ’/, IM without over¬ 
flows by Schrage's method. 

Will be in the range 0. .NTAB-1. 

Output previously stored value and refill the 
shuffle table. 

Because users don't expect endpoint values. 



The routine rani passes those statistical tests that ranO is known to fail. In 
fact, we do not know of any statistical test that rani fails to pass, except when the 
number of calls starts to become on the order of the period m, say > 10 8 w m/20. 

For situations when even longer random sequences are needed, L’Ecuyer [6] has 
given a good way of combining two different sequences with different periods so 
as to obtain a new sequence whose period is the least common multiple of the two 
periods. The basic idea is simply to add the two sequences, modulo the modulus of 
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Figure 7.1.1. Shuffling procedure used in rani to break up sequential correlations in the Minimal 
Standard generator. Circled numbers indicate the sequence of events: On each call, the random number 
in iy is used to choose a random element in the array iv. That element becomes the output random 
number, and also is the next iy. Its spot in iv is refilled from the Minimal Standard routine. 


either of them (call it to). A trick to avoid an intermediate value that overflows the 
integer wordsize is to subtract rather than add, and then add back the constant to — 1 
if the result is < 0, so as to wrap around into the desired interval 0, ..., to — 1 . 

Notice that it is not necessary that this wrapped subtraction be able to reach 
all values 0,..., m — 1 from every value of the first sequence. Consider the absurd 
extreme case where the value subtracted was only between 1 and 10: The resulting 
sequence would still be no less random than the first sequence by itself. As a 
practical matter it is only necessary that the second sequence have a range covering 
substantially all of the range of the first. L’Ecuyer recommends the use of the two 
generators mi = 2147483563 (with ai = 40014, qi = 53668, n = 12211) and 
mi = 2147483399 (with o 2 = 40692, q 2 = 52774, r 2 = 3791). Both moduli 
are slightly less than 2 31 . The periods rrij - 1 = 2 x 3 x 7 x 631 x 81031 and 
to 2 — 1 = 2 x 19 x 31 x 1019 x 1789 share only the factor 2, so the period of 
the combined generator is « 2.3 x 10 18 . For present computers, period exhaustion 
is a practical impossibility. 

Combining the two generators breaks up serial correlations to a considerable 
extent. We nevertheless recommend the additional shuffle that is implemented in 
the following routine, ran2. We think that, within the limits of its floating-point 
precision, ran2 provides perfect random numbers; a practical definition of “perfect” 
is that we will pay $1000 to the first reader who convinces us otherwise (by finding a 
statistical test that ran2 fails in a nontrivial way, excluding the ordinary limitations 
of a machine’s floating-point representation). 
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#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 


IM1 2147483563 
IM2 2147483399 
AM (1.0/IM1) 

IMM1 (IM1-1) 

IA1 40014 
IA2 40692 
IQ1 53668 
IQ2 52774 
IR1 12211 
IR2 3791 
NTAB 32 

NDIV (1+IMM1/NTAB) 
EPS 1.2e-7 
RNMX (1.0-EPS) 


float ran2(long *idum) 

Long period (> 2 X 10 18 ) random number generator of L'Ecuyer with Bays-Durham shuffle 
and added safeguards. Returns a uniform random deviate between 0.0 and 1.0 (exclusive of 
the endpoint values). Call with idum a negative integer to initialize; thereafter, do not alter 
idum between successive deviates in a sequence. RNMX should approximate the largest floating 
value that is less than 1. 


int j; 
long k; 

static long idum2=123456789; 
static long iy=0; 
static long iv[NTAB]; 
float temp; 

if (*idum <= 0) { 

if (-(*idum) < 1) *idum=l; 
else *idum = -(*idum); 
idum2=(*idum); 
for (j=NTAB+7; j>=0; j — ) { 
k=(*idum)/iqi; 

*idum=IAl*(*idum-k*iqi)-k*IRl 
if (*idum < 0) *idum += IM1; 
if (j < NTAB) iv[j] = *idum; 

> 

iy=iv [0] ; 

> 

k=(*idum)/iqi; 

*idum=IAl*(*idum-k*iqi)-k*IRl; 
if (*idum < 0) *idum += IM1; 
k=idum2/IQ2; 

idum2=IA2*(idum2-k*iq2)-k*IR2; 

if (idum2 < 0) idum2 += IM2; 

j=iy/NDIV; 

iy=iv[j]-idum2; 

iv[j] = *idum; 

if (iy < 1) iy += IMM1; 

if ((temp=AM*iy) > RNMX) return RNMX; 

else return temp; 


Initialize. 

Be sure to prevent idum = 0, 


Load the shuffle table (after 8 warm-ups). 


Start here when not initializing. 

Compute idum=(IAl*idum) '/, IM1 without 
overflows by Schrage's method. 

Compute idum2=(IA2*idum) ’/, IM2 likewise. 

Will be in the range 0. .NTAB-1. 

Here idum is shuffled, idum and idum2 are 
combined to generate output. 

Because users don't expect endpoint values. 


L’Ecuyer [6] lists additional short generators that can be combined into longer 
ones, including generators that can be implemented in 16-bit integer arithmetic. 

Finally, we give you Knuth’s suggestion [4] for a portable routine, which we 
have translated to the present conventions as ran3. This is not based on the linear 
congruential method at all, but rather on a subtractive method (see also [5]). One 
might hope that its weaknesses, if any, are therefore of a highly different character 
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from the weaknesses, if any, of rani above. If you ever suspect trouble with one 
routine, it is a good idea to try the other in the same application. ran3 has one 
nice feature: if your machine is poor on integer arithmetic (i.e., is limited to 16-bit 
integers), you can declare mj, mk, and ma[] as float, define mbig and mseed 
as 4000000 and 1618033, respectively, and the routine will be rendered entirely 
floating-point. 


#include <stdlib.h> Change to math.h in K&R C. 

#define MBIG 1000000000 
#define MSEED 161803398 
#define MZ 0 
#define FAC (1.0/MBIG) 

According to Knuth, any large MBIG, and any smaller (but still large) MSEED can be substituted 
for the above values. 


float ran3(long *idum) 

Returns a uniform random deviate between 0.0 and 1.0. Set idum to any negative value to 
initialize or reinitialize the sequence. 

{ 

static int inext,inextp; 

static long ma[56] ; The value 56 (range ma[l. .55]) is special and 

static int iff=0; should not be modified; see Knuth. 

long mj,mk; 
int i,ii,k; 


} 


if (*idum <011 iff == 0) { 
iff=l; 

mj=labs(MSEED-labs(*idum)); 
mj %= MBIG; 
ma [55] =mj ; 
mk=l; 

for (i=l;i<=54;i++) { 
ii=(21*i) % 55; 
ma[ii]=mk; 
mk=m j -mk; 

if (mk < MZ) mk += MBIG; 
mj=ma[ii] ; 

} 


Initialization. 

Initialize ma[55] using the seed idum and the 
large number MSEED. 


Now initialize the rest of the table, 

in a slightly random order, 

with numbers that are not especially random. 


for (k=l;k<=4;k++) We randomize them by “warming up the gener- 

for (i=l;i<=55;i++) { ator.” 

ma[i] -= ma[l+(i+30) ’/. 55]; 
if (ma[i] < MZ) ma[i] += MBIG; 


> 

inext=0; Prepare indices for our first generated number. 

inextp=31; The constant 31 is special; see Knuth. 


*idum=l; 


Here is where we start, except on initialization. 

if (++inext == 56) inext=l; Increment inext and inextp, wrapping around 

if (++inextp == 56) inextp=l; 56 to 1. 

mj=ma[inext]-ma[inextp] ; Generate a new random number subtractively. 

if (mj < MZ) mj += MBIG; Be sure that it is in range, 

ma [inext] =mj ; Store it, 

return mj*FAC; and output the derived uniform deviate. 



Quick and Dirty Generators 

One sometimes would like a “quick and dirty” generator to embed in a program, perhaps 
taking only one or two lines of code, just to somewhat randomize things. One might wish to 
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process data from an experiment not always in exactly the same order, for example, so that 
the first output is more “typical” than might otherwise be the case. 

For this kind of application, all we really need is a list of “good” choices for m, a, and 
c in equation (7.1.1). If we don’t need a period longer than 10 4 to 10 6 , say, we can keep the 
value of (to — l)a + c small enough to avoid overflows that would otherwise mandate the 
extra complexity of Schrage’s method (above). We can thus easily embed in our programs 

unsigned long jran,ia,ic,im; 
float ran; 

jran=(jran*ia+ic) ’/, im; 
ran=(float) jran / (float) im; 

whenever we want a quick and dirty uniform deviate, or 



jran=(jran*ia+ic) ’/. im; 
j=jlo+((jhi-jlo+l)*jran)/im; 

whenever we want an integer between jlo and jhi, inclusive. (In both cases jran was once 
initialized to any seed value between 0 and im-1.) 

Be sure to remember, however, that when im is small, the fcth root of it, which is the 
number of planes in fc-space, is even smaller! So a quick and dirty generator should never 
be used to select points in fc-space with k > 1. 

With these caveats, some “good” choices for the constants are given in the accompanying 
table. These constants (i) give a period of maximal length im, and, more important, (ii) pass 
Knuth’s “spectral test” for dimensions 2, 3, 4, 5, and 6. The increment ic is a prime, close to 
the value (| — |x/3)im; actually almost any value of ic that is relatively prime to im will do 
just as well, but there is some “lore” favoring this choice (see [4], p. 84). 


An Even Quicker Generator 

In C, if you multiply two unsigned long int integers on a machine with a 32-bit long 
integer representation, the value returned is the low-order 32 hits of the true 64-bit product. If 
we now choose to = 2 32 , the “mod” in equation (7.1.1) is free, and we have simply 

lj + ^0alj +c (7.1.6) 

Rnuth suggests a = 1664525 as a suitable multiplier for this value of to. H.W. Lewis 
has conducted extensive tests of this value of a with c = 1013904223, which is a prime close 
to {y/h — 2)to. The resulting in-line generator (we will call it ranqdl) is simply 



unsigned long idum; 

idum = 1664525L*idum + 1013904223L; 

This is about as good as any 32-bit linear congruential generator, entirely adequate for many 
uses. And, with only a single multiply and add, it is very fast. 

To check whether your machine has the desired integer properties, see if you can 
generate the following sequence of 32-bit values (given here in hex): 00000000, 3C6EF35F, 
47502932, D1CCF6E9, AAF95334, 6252E503, 9F2EC686, 57FE6C2D, A3D95FA8, 81FD- 
BEE7, 94F0AF1A, CBF633B1. 

If you need floating-point values instead of 32-bit integers, and want to avoid a divide by 
floating-point 2 32 , a dirty trick is to mask in an exponent that makes the value lie between 1 and 
2, then subtract 1.0. The resulting in-line generator (call it ranqd2) will look something like 
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Constants for Quick and Dirty Random Number Generators 

overflow at im ia ic 

overflow at im ia ic 

6075 106 1283 

86436 1093 18257 

2 20 

121500 1021 25673 

7875 211 1663 

259200 421 54773 

2 21 

2 2T 

7875 421 1663 

117128 1277 24749 

2 22 

121500 2041 25673 

6075 1366 1283 

312500 741 66037 

6655 936 1399 

2 28 

11979 430 2531 

145800 3661 30809 

2 23 

175000 2661 36979 

14406 967 3041 

233280 1861 49297 

29282 419 6173 

244944 1597 51749 

53125 171 11213 

2 29 

2 24 

139968 3877 29573 

12960 1741 2731 

214326 3613 45289 

14000 1541 2957 

714025 1366 150889 

21870 1291 4621 

2 30 

31104 625 6571 

134456 8121 28411 

139968 205 29573 

259200 7141 54773 

2 25 

2 31 

29282 1255 6173 

233280 9301 49297 

81000 421 17117 

714025 4096 150889 

134456 281 28411 

2 32 

2 26 



unsigned long idum,itemp; 
float rand; 

#ifdef vax 

static unsigned long jflone = 0x00004080; 
static unsigned long jflmsk = 0xffff007f; 
#else 

static unsigned long jflone = 0x3f800000; 
static unsigned long jflmsk = 0x007fffff; 
#endif 

idum = 1664525L*idum + 1013904223L; 
itemp = jflone I (jflmsk k idum); 
rand = (*(float *)fcitemp)-l.0; 


The hex constants 3F800000 and 007FFFFF are the appropriate ones for computers using 
the IEEE representation for 32-bit floating-point numbers (e.g., IBM PCs and most UNIX 
workstations). For DEC VAXes, the correct hex constants are, respectively, 00004080 and 
FFFF007F. Notice that the IEEE mask results in the floating-point number being constructed 
out of the 23 low-order bits of the integer, which is not ideal. (Your authors have tried 
very hard to make almost all of the material in this book machine and compiler independent 
— indeed, even programming language independent. This subsection is a rare aberration. 
Forgive us. Once in a great while the temptation to be really dirty is just irresistible.) 



Relative Timings and Recommendations 


Timings are inevitably machine dependent. Nevertheless the following table 
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is indicative of the relative timings, for typical machines, of the various uniform 
generators discussed in this section, plus ran4 from §7.5. Smaller values in the table 
indicate faster generators. The generators ranqdl and ranqd2 refer to the “quick 
and dirty” generators immediately above. 


Generator 

Relative Execution Time 

ranO 

= 1.0 

rani 

w 1.3 

ran2 

m 2.o 

ran3 

ps 0.6 

ranqdl 

ps 0.10 

ranqd2 

ps 0.25 

ran4 

ps 4.0 


On balance, we recommend rani for general use. It is portable, based on 
Park and Miller’s Minimal Standard generator with an additional shuffle, and has no 
known (to us) flaws other than period exhaustion. 

If you are generating more than 100,000,000 random numbers in a single 
calculation (that is, more than about 5% of rani’s period), we recommend the use 
of ran2, with its much longer period. 

Knuth’s subtractive routine ran3 seems to be the timing winner among portable 
routines. Unfortunately the subtractive method is not so well studied, and not a 
standard. We like to keep ran3 in reserve for a “second opinion,” substituting it when 
we suspect another generator of introducing unwanted correlations into a calculation. 

The routine ran4 generates extremely good random deviates, and has some 
other nice properties, but it is slow. See §7.5 for discussion. 

Finally, the quick and dirty in-line generators ranqdl and ranqd2 are very fast, 
but they are somewhat machine dependent, and at best only as good as a 32-bit linear 
congruential generator ever is — in our view not good enough in many situations. 
We would use these only in very special cases, where speed is critical. 


CITED REFERENCES AND FURTHER READING: 

Park, S.K., and Miller, K.W. 1988, Communications of the ACM, vol. 31, pp. 1192-1201. [1] 
Schrage, L. 1979, ACM Transactions on Mathematical Software, vol. 5, pp. 132-138. [2] 

Bratley, P., Fox, B.L., and Schrage, E.L. 1983, A Guide to Simulation (New York: Springer- 
Verlag). [3] 

Knuth, D.E. 1981, Seminumerical Algorithms, 2nd ed., vol. 2 of The Art of Computer Programming 
(Reading, MA: Addison-Wesley), §§3.2—3.3. [4] 

Kahaner, D., Moler, C., and Nash, S. 1989, Numerical Methods and Software (Englewood Cliffs, 
NJ: Prentice Hall), Chapter 10. [5] 

LEcuyer, P. 1988, Communications of the ACM, vol. 31, pp. 742-774. [6] 

Forsythe, G.E., Malcolm, M.A., and Moler, C.B. 1977, Computer Methods for Mathematical 
Computations (Englewood Cliffs, NJ: Prentice-Hall), Chapter 10. 
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7.2 Transformation Method: Exponential and 
Normal Deviates 


In the previous section, we learned how to generate random deviates with 
a uniform probability distribution, so that the probability of generating a number 
between x and x + dx, denoted p(x)dx, is given by 


p{x)dx = { q X 


0 < x < 1 

otherwise 


The probability distribution p(x) is of course normalized, so that 

p(x)dx = 1 




(7.2.1) 


(7.2.2) 


Now suppose that we generate a uniform deviate x and then take some prescribed 
function of it, y(x). The probability distribution of y, denoted p(y)dy, is determined 
by the fundamental transformation law of probabilities, which is simply 

\p{y)dy\ = \p(x)dx\ 

or 

p(y) =p(x) |^| 

Exponential Deviates 

As an example, suppose that y(x) = — ln(x), and that p(x) is as given by 
equation (7.2.1) for a uniform deviate. Then 


p(y)dy = ^ dy = e 
dy 


(7.2.5) 


which is distributed exponentially. This exponential distribution occurs frequently 
in real problems, usually as the distribution of waiting times between independent 
Poisson-random events, for example the radioactive decay of nuclei. You can also 
easily see (from 7.2.4) that the quantity y/X has the probability distribution Ae ~ Xy . 
So we have 

#include <math.h> 
float expdev(long *idum) 

Returns an exponentially distributed, positive, random deviate of unit mean, using 
ranl(idum) as the source of uniform deviates. 

{ 

float rani(long *idum); 
float dum; 
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do 

dum=ranl(idum); 
while (dum == 0.0); 
return -log(dum); 
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Figure 7.2.1. Transformation method for generating a random deviate y from a known probability 
distribution p{y). The indefinite integral of p{y ) must be known and invertible. A uniform deviate x is 
chosen between 0 and 1. Its corresponding y on the definite-integral curve is the desired deviate. 

Let’s see what is involved in using the above transformation method to generate 
some arbitrary desired distribution of y’s, say one with p(y) = f(y) for some positive 
function / whose integral is 1. (See Figure 7.2.1.) According to (7.2.4), we need 
to solve the differential equation 

! = /<») (7.2.6) 

But the solution of this is just x = F(y), where F(y) is the indefinite integral of 
f(y). The desired transformation which takes a uniform deviate into one distributed 
as f(y) i s therefore 

y(x) = F *(x) (7.2.7) 

where F _1 is the inverse function to F. Whether (7.2.7) is feasible to implement 
depends on whether the inverse function of the integral off(y) is itself feasible to 
compute, either analytically or numerically. Sometimes it is, and sometimes it isn’t. 

Incidentally, (7.2.7) has an immediate geometric interpretation: Since F(y) is 
the area under the probability curve to the left of y, (7.2.7) is just the prescription: 
choose a uniform random x, then find the value y that has that fraction x of 
probability area to its left, and return the value y. 

Normal (Gaussian) Deviates 

Transformation methods generalize to more than one dimension. If xi,X 2 , 
... are random deviates with a joint probability distribution p(xi,x 2 ,-..) 
dxidx 2 ■ ■ - , and if y-\. y 2 , ... are each functions of all the x’s (same number of 
y’s as x’s), then the joint probability distribution of the y’s is 

I d(x\ X2 ) I 

p(yi,y 2 ,---)dy 1 dy 2 ... = p(x lt x 2 , ■ ■ •) \dyidy 2 ■ ■ ■ (7.2.8) 

I d(y 1 ,y 2 ,...) | 

where \d( )/d{ )| is the Jacobian determinant of the x’s with respect to the y’s 

(or reciprocal of the Jacobian determinant of the y’s with respect to the x’s). 



s o- i 
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An important example of the use of (7.2.8) is the Box-Muller method for 
generating random deviates with a normal (Gaussian) distribution, 

p{y)dy = —L=e _3/2/2 dy (7.2.9) 

v 27r 

Consider the transformation between two uniform deviates on (0,1), xi,x 2 , and 
two quantities yi,y 2 , 


yi = \J —2 In x\ cos 2-kx 2 
2/2 = y—21nxi sin 27TX2 

Equivalently we can write 


a;i = exp | 

-\{vl + yl) 

1 

2/2 

xo = —arc tan— 

Z7T 

y i 


(7.2.10) 


(7.2.11) 


Now the Jacobian determinant can readily be calculated (try it!): 


d(xi,x 2 ) 

9(yi,V2) 


a Xl 

d Xl 

dy i 

dy2 

dx 2 

dx 2 

dyi 

dy 2 




(7.2.12) 


Since this is the product of a function of y 2 alone and a function of y i alone, we see 
that each y is independently distributed according to the normal distribution (7.2.9). 

One further trick is useful in applying (7.2.10). Suppose that, instead of picking 
uniform deviates x\ and x 2 in the unit square, we instead pick v j and v 2 as the 
ordinate and abscissa of a random point inside the unit circle around the origin. Then 
the sum of their squares, R 2 = v 2 +v § is a uniform deviate, which can be used for x i, 
while the angle that (v \, v 2 ) defines with respect to the v\ axis can serve as the random 
angle 2-kx 2 . What’s the advantage? It’s that the cosine and sine in (7.2.10) can now 
be written as v% / \fR 2 and v 2 /\[R?, obviating the trigonometric function calls! 

We thus have 


#include <math.h> 
float gasdev(long *idum) 

Returns a normally distributed deviate with zero mean and unit variance, using ranl(idum) 
as the source of uniform deviates. 

{ 

float rani(long *idum); 
static int iset=0; 
static float gset; 
float fac,rsq,vl,v2; 

if (*idum < 0) iset=0; 
if (iset == 0) { 
do { 

vl=2.0*ranl(idum)-1.0 
v2=2.0*ranl(idum)-1.0 
rsq=vl*vl+v2*v2; 


Reinitialize. 

We don't have an extra deviate handy, so 

pick two uniform numbers in the square ex¬ 
tending from -1 to +1 in each direction, 
see if they are in the unit circle, 
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> 


> 

> 


> while (rsq >= 1.0 II rsq == 0.0); and if they are not, try again. 

fac=sqrt(-2.0*log(rsq)/rsq); 

Now make the Box-Muller transformation to get two normal deviates. Return one and 
save the other for next time. 
gset=vl*fac; 

iset=l; Set flag. 


return v2*fac; 
else { 
iset=0; 
return gset; 


We have an extra deviate handy, 
so unset the flag, 
and return it. 


See Devroye [1 ] and Bratley [2] for many additional algorithms. 

CITED REFERENCES AND FURTHER READING: 

Devroye, L. 1986, Non-Uniform Random Variate Generation (New York: Springer-Verlag), §9.1. 
[1] 

Bratley, R, Fox, B.L., and Schrage, E.L. 1983, A Guide to Simulation (New York: Springer- 
Verlag). [2] 

Knuth, D.E. 1981, Seminumerical Algorithms, 2nd ed., vol. 2 of The Art of Computer Programming 
(Reading, MA: Addison-Wesley), pp. 116ff. 


7.3 Rejection Method: Gamma, Poisson, 
Binomial Deviates 

The rejection method is a powerful, general technique for generating random 
deviates whose distribution function p(x) dx (probability of a value occurring between 
x and x + dx) is known and computable. The rejection method does not require 
that the cumulative distribution function [indefinite integral of p(x )] be readily 
computable, much less the inverse of that function — which was required for the 
transformation method in the previous section. 

The rejection method is based on a simple geometrical argument: 

Draw a graph of the probability distribution p(x) that you wish to generate, so 
that the area under the curve in any range of x corresponds to the desired probability 
of generating an x in that range. If we had some way of choosing a random point in 
two dimensions, with uniform probability in the area under your curve, then the x 
value of that random point would have the desired distribution. 

Now, on the same graph, draw any other curve f(x) which has finite (not 
infinite) area and lies everywhere above your original probability distribution. (This 
is always possible, because your original curve encloses only unit area, by definition 
of probability.) We will call this f(x) the comparison junction. Imagine now 
that you have some way of choosing a random point in two dimensions that is 
uniform in the area under the comparison function. Whenever that point lies outside 
the area under the original probability distribution, we will reject it and choose 
another random point. Whenever it lies inside the area under the original probability 
distribution, we will accept it. It should be obvious that the accepted points are 
uniform in the accepted area, so that their x values have the desired distribution. It 
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second random 
deviate in 


Figure 7.3.1. Rejection method for generating a random deviate x from a known probability distribution 
p(x) that is everywhere less than some other function f{x). The transformation method is first used to 
generate a random deviate x of the distribution / (compare Figure 7.2.1). A second uniform deviate is 
used to decide whether to accept or reject that x. If it is rejected, a new deviate of / is found; and so on. 
The ratio of accepted to rejected points is the ratio of the area under p to the area between p and /. 

should also be obvious that the fraction of points rejected just depends on the ratio 
of the area of the comparison function to the area of the probability distribution 
function, not on the details of shape of either function. For example, a comparison 
function whose area is less than 2 will reject fewer than half the points, even if it 
approximates the probability function very badly at some values of x, e.g., remains 
finite in some region where p(x ) is zero. 

It remains only to suggest how to choose a uniform random point in two 
dimensions under the comparison function f(x). A variant of the transformation 
method (§7.2) does nicely: Be sure to have chosen a comparison function whose 
indefinite integral is known analytically, and is also analytically invertible to give x 
as a function of “area under the comparison function to the left of x Now pick a 
uniform deviate between 0 and A, where A is the total area under f(x), and use it 
to get a corresponding x. Then pick a uniform deviate between 0 and f(x) as the y 
value for the two-dimensional point. You should be able to convince yourself that the 
point (x, y) is uniformly distributed in the area under the comparison function f(x). 

An equivalent procedure is to pick the second uniform deviate between zero 
and one, and accept or reject according to whether it is respectively less than or 
greater than the ratio p(x)/f(x). 

So, to summarize, the rejection method for some given p(x) requires that one 
find, once and for all, some reasonably good comparison function f(x). Thereafter, 
each deviate generated requires two uniform random deviates, one evaluation of / (to 
get the coordinate y), and one evaluation of p (to decide whether to accept or reject 
the point x, y). Figure 7.3.1 illustrates the procedure. Then, of course, this procedure 
must be repeated, on the average, A times before the final deviate is obtained. 


Gamma Distribution 

The gamma distribution of integer order a > 0 is the waiting time to the ath 
event in a Poisson random process of unit mean. For example, when a = 1, it is just 
the exponential distribution of §7.2, the waiting time to the first event. 
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A gamma deviate has probability p a (x)dx of occurring with a value between 
x and x + dx, where 


p a (x)dx = — — 7 -r —dx x > 0 (7.3.1) 

T(a) 

To generate deviates of (7.3.1) for small values of a, it is best to add up a 
exponentially distributed waiting times, i.e., logarithms of uniform deviates. Since 
the sum of logarithms is the logarithm of the product, one really has only to generate 
the product of a uniform deviates, then take the log. 

For larger values of a, the distribution (7.3.1) has a typically “bell-shaped” 
form, with a peak at x = a and a half-width of about ^fa. 

We will be interested in several probability distributions with this same qual¬ 
itative form. A useful comparison function in such cases is derived from the 
Lorentzian distribution 


p(y)dy=-(-^]dy (7.3.2) 

7T \l + y 2 ) 

whose inverse indefinite integral is just the tangent function. It follows that the 
^-coordinate of an area-uniform random point under the comparison function 


f(x) 


_co_ 

1 + (X - x 0 ) 2 /al 


for any constants a o, cq, and xq, can be generated by the prescription 


(7.3.3) 


x = ao tan(7r(7) + icq 


(7.3.4) 


where U is a uniform deviate between 0 and 1. Thus, for some specific “bell-shaped” 
p(x) probability distribution, we need only find constants oo, Co, xo, with the product 
ooco (which determines the area) as small as possible, such that (7.3.3) is everywhere 
greater than p(x). 

Ahrens has done this for the gamma distribution, yielding the following 
algorithm (as described in Knuth [1 ]): 

#include <math.h> 

float gamdev(int ia, long *idum) 

Returns a deviate distributed as a gamma distribution of integer order ia, i.e., a waiting time 
to the iath event in a Poisson process of unit mean, using ranl(idum) as the source of 
uniform deviates. 

{ 

float rani(long *idum); 

void nrerror(char error_text[]); 

int j; 

float am,e,s > vl,v2,x > y; 

if (ia < 1) nrerror("Error in routine gamdev"); 

if (ia < 6) { Use direct method, adding waiting 

x=1.0; times, 

for (j=l;j<=ia;j++) x *= ranl(idum); 
x = -log(x); 

> else { 



Use rejection method. 
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do { 

do { 

do { 

vl=ranl(idum); 

v2=2.0*ranl(idum)-1.0; 

> while (vl*vl+v2*v2 > 1.0); 
y=v2/vl; 
am=ia-l; 

s=sqrt(2.0*am+l.0); 
x=s*y+am; 

} while (x <= 0.0); 
e=(l.0+y*y)*exp(am*log(x/am)-s*y); 
> while (rani(idum) > e); 

> 

return x; 

> 


These four lines generate the tan¬ 
gent of a random angle, i.e., they 
are equivalent to 
y = tan(7r * rani (idum)). 


We decide whether to reject x: 
Reject in region of zero probability. 
Ratio of prob. fn. to comparison fn. 
Reject on basis of a second uniform 
deviate. 


Poisson Deviates 


The Poisson distribution is conceptually related to the gamma distribution. It 
gives the probability of a certain integer number m of unit rate Poisson random 
events occurring in a given interval of time x, while the gamma distribution was the 
probability of waiting time between x and x + dx to the mth event. Note that m takes 
on only integer values > 0, so that the Poisson distribution, viewed as a continuous 
distribution function p x (m)dm, is zero everywhere except where to is an integer 
> 0. At such places, it is infinite, such that the integrated probability over a region 
containing the integer is some finite number. The total probability at an integer j is 

Prob (j)=[ p x (rri)dm = X 6 (7.3.5) 

Jj-e 3- 

At first sight this might seem an unlikely candidate distribution for the rejection 
method, since no continuous comparison function can be larger than the infinitely 
tall, but infinitely narrow, Dirac delta functions in p x (m). However, there is a trick 
that we can do: Spread the finite area in the spike at j uniformly into the interval 
between j and j + 1. This defines a continuous distribution q x (m)dm given by 


q x (m)dm = 



(7.3.6) 


where [to] represents the largest integer less than to. If we now use the rejection 
method to generate a (noninteger) deviate from (7.3.6), and then take the integer 
part of that deviate, it will be as if drawn from the desired distribution (7.3.5). (See 
Figure 7.3.2.) This trick is general for any integer-valued probability distribution. 

For x large enough, the distribution (7.3.6) is qualitatively bell-shaped (albeit 
with a bell made out of small, square steps), and we can use the same kind of 
Lorentzian comparison function as was already used above. For small x, we can 
generate independent exponential deviates (waiting times between events); when the 
sum of these first exceeds x, then the number of events that would have occurred in 
waiting time x becomes known and is one less than the number of terms in the sum. 

These ideas produce the following routine: 
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Figure 7.3.2. Rejection method as applied to an integer-valued distribution. The method is performed 
on the step function shown as a dashed line, yielding a real-valued deviate. This deviate is rounded down 
to the next lower integer, which is output. 


#include <math.h> 

#define PI 3.141592654 

float poidev(float xm, long *idum) 

Returns as a floating-point number an integer value that is a random deviate drawn from a 
Poisson distribution of mean xm, using ranl(idum) as a source of uniform random deviates. 

{ 

float gammln(float xx); 
float rani(long *idum); 

static float sq,alxm,g,oldiii=(-l .0) ; oldm is a flag for whether xm has changed 


since last call 
Use direct method. 

If xm is new, compute the exponential. 


float em,t,y; 

if (xm < 12.0) { 

if (xm != oldm) { 
oldm=xm; 
g=exp(-xm); 

> 

em = -1; 
t=l.0; 
do { 

++em; 

t *= ranl(idum); 

} while (t > g); 

> else { 

if (xm != oldm) { 
oldm=xm; 

sq=sqrt(2.0*xm); 
alxm=log(xm); 
g=xm*alxm-gammln(xm+1.0); 

The function gammln is the natural log of the gamma function, as given in 

> 

do { 

do { 


Instead of adding exponential deviates it is equiv¬ 
alent to multiply uniform deviates. We never 
actually have to take the log, merely com¬ 
pare to the pre-computed exponential. 

Use rejection method. 

If xm has changed since the last call, then pre¬ 
compute some functions that occur below. 



y=tan(PI*ranl(idum)); 


y is a deviate from a Lorentzian comparison func¬ 
tion. 
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em=sq*y+xm; em is y, shifted and scaled. 

> while (em < 0.0); Reject if in regime of zero probability. 

em=floor (em); The trick for integer-valued distributions. 

t=0.9*(1.0+y*y)*exp(em*alxm-gammln(em+l.0)-g); 

The ratio of the desired distribution to the comparison function; we accept or 
reject by comparing it to another uniform deviate. The factor 0.9 is chosen so 
that t never exceeds 1. 

} while (ranl(idum) > t); 

> 

return em; 


Binomial Deviates 


If an event occurs with probability q, and we make n trials, then the number of 
times m that it occurs has the binomial distribution, 



Pn,q('m)dm = 


(^(l 


(7.3.7) 


The binomial distribution is integer valued, with m taking on possible values 
from 0 to n. It depends on two parameters, n and q, so is correspondingly a 
bit harder to implement than our previous examples. Nevertheless, the techniques 
already illustrated are sufficiently powerful to do the job: 


#include <math.h> 

#define PI 3.141592654 


float bnldev(float pp, int n, long *idum) 

Returns as a floating-point number an integer value that is a random deviate drawn from 
a binomial distribution of n trials each of probability pp, using ranl(ichim) as a source of 
uniform random deviates. 

{ 

float gammln(float xx); 
float rani(long *idum); 
int j; 

static int nold=(-l); 

float am,em,g,angle,p,bnl,sq,t,y; 

static float pold=(-l.0),pc,plog,pclog,en,oldg; 


p=(pp <= 0.5 ? pp : 1.0-pp); 

The binomial distribution is invariant under changing pp to 1-pp, if we also change the 
answer to n minus itself; we'll remember to do this below. 

am=n*p; This is the mean of the deviate to be produced, 

if (n < 25) { Use the direct method while n is not too large. 

bnl=0.0; This can require up to 25 calls to rani, 

for (j=l;j<=n;j++) 

if (ranl(idum) < p) ++bnl; 

> else if (am < 1.0) { If fewer than one event is expected out of 25 

g=exp(-am); or more trials, then the distribution is quite 

t=1.0; accurately Poisson. Use direct Poisson method, 

for (j=0;j<=n;j++) { 


t *= rani(idum); 
if (t < g) break; 


} 

bnl=(j <= n 
> else { 


n); 



? J : 


Use the rejection method. 
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> 


if (n != nold) { 
en=n; 

oldg=gammln(en+l. 0); 
nold=n; 

> if (p != pold) { 

pc=l,0-p; 
plog=log(p); 
pclog=log(pc); 
pold=p; 

> 

sq=sqrt(2.0*am*pc); 
do { 

do { 

angle=PI*ranl(idum); 
y=tan(angle); 


If n has changed, then compute useful quanti¬ 
ties. 


If p has changed, then compute useful quanti¬ 
ties. 


The following code should by now seem familiar: 
rejection method with a Lorentzian compar¬ 
ison function. 


em=sq*y+am; 

} while (em <0.0 II em >= (en+1.0)); Reject. 

em=floor (em); Trick for integer-valued distribution. 

t=l.2*sq*(1.0+y*y)*exp(oldg-gammln(em+l.0) 

-gaimnln(en-em+l. 0)+em*plog+(en-em)*pclog); 

> while (rani (idum) > t); Reject. This happens about 1.5 times per devi- 

bnl=em; ate, on average. 


> 

if (p != pp) bnl=n-bnl; Remember to undo the symmetry transforma- 

return bnl; tion. 


See Devroye [2] and Bratley [3] for many additional algorithms. 


CITED REFERENCES AND FURTHER READING: 

Knuth, D.E. 1981, Seminumerical Algorithms, 2nd ed., vol. 2 of The Art of Computer Programming 
(Reading, MA: Addison-Wesley), pp. 120ff. [1] 

Devroye, L. 1986, Non-Uniform Random Variate Generation (New York: Springer-Verlag), §X.4. 

[2] 

Bratley, R, Fox, B.L., and Schrage, E.L. 1983, A Guide to Simulation (New York: Springer- 
Verlag). [3], 


7.4 Generation of Random Bits 

The C language gives you useful access to some machine-level bitwise operations 
such as «(left shift). This section will show you how to put such abilities to good use. 

The problem is how to generate single random bits, with 0 and 1 equally 
probable. Of course you can just generate uniform random deviates between zero 
and one and use their high-order bit (i.e., test if they are greater than or less than 
0.5). However this takes a lot of arithmetic; there are special-purpose applications, 
such as real-time signal processing, where you want to generate bits very much 
faster than that. 

One method for generating random bits, with two variant implementations, is 
based on “primitive polynomials modulo 2.” The theory of these polynomials is 
beyond our scope (although §7.7 and §20.3 will give you small tastes of it). Here, 
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suffice it to say that there are special polynomials among those whose coefficients 
are zero or one. An example is 

x 18 + x 5 + x 2 + x 1 + x° (7.4.1) 

which we can abbreviate by just writing the nonzero powers of x, e.g., 

(18,5,2,1,0) 

Every primitive polynomial modulo 2 of order n (= 18 above) defines a recurrence 
relation for obtaining a new random bit from the n preceding ones. The recurrence 
relation is guaranteed to produce a sequence of maximal length, i.e., cycle through 
all possible sequences of n bits (except all zeros) before it repeats. Therefore one 
can seed the sequence with any initial bit pattern (except all zeros), and get 2 ” — 1 
random bits before the sequence repeats. 

Let the bits be numbered from 1 (most recently generated) through n (generated 
n steps ago), and denoted U \, 0 , 2 ,..., a„. We want to give a formula for a new bit 
oo- After generating ao we will shift all the bits by one, so that the old a n is finally 
lost, and the new ao becomes ai. We then apply the formula again, and so on. 

“Method I” is the easiest to implement in hardware, requiring only a single 
shift register n bits long and a few XOR (“exclusive or” or bit addition mod 2) 
gates, the operation denoted in C by “A”. For the primitive polynomial given above, 
the recurrence formula is 


ao = ais A as A 02 A aj (7.4.2) 

The terms that are A’d together can be thought of as “taps” on the shift register, 
A’d into the register’s input. More generally, there is precisely one term for each 
nonzero coefficient in the primitive polynomial except the constant (zero bit) term. 
So the first term will always be a n for a primitive polynomial of degree n, while the 
last term might or might not be ai, depending on whether the primitive polynomial 
has a term in x 1 . 

While it is simple in hardware. Method I is somewhat cumbersome in C, because 
the individual bits must be collected by a sequence of full-word masks: 


int irbitl(unsigned long *iseed) 

Returns as an integer a random bit, based on 
modified for the next call). 

{ 

unsigned long newbit; 

newbit = (*iseed » 17) & 1 
" (*iseed » 4) & 1 

" (*iseed » 1) & 1 

" (*iseed & 1); 

*iseed=(*iseed « 1) I newbit; 

return (int) newbit; 


18 low-significance bits in iseed (which is 


The accumulated XOR's. 

Get bit 18. 

XOR with bit 5. 

XOR with bit 2. 

XOR with bit 1. 

Leftshift the seed and put the result of the 
XOR’s in its bit 1. 



> 
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Figure 7.4.1. Two related methods for obtaining random bits from a shift register and a primitive 
polynomial modulo 2. (a) The contents of selected taps are combined by exclusive-or (addition modulo 
2), and the result is shifted in from the right. This method is easiest to implement in hardware, (b) 
Selected bits are modified by exclusive-or with the leftmost bit, which is then shifted in from the right. 
This method is easiest to implement in software. 

“Method II” is less suited to direct hardware implementation (though still 
possible), but is beautifully suited to C. It modifies more than one bit among the 
saved n bits as each new bit is generated (Figure 7.4.1). It generates the maximal 
length sequence, but not in the same order as Method I. The prescription for the 
primitive polynomial (7.4.1) is: 


o-o — °18 
U 5 = U 5 A a 0 
U 2 = U 2 A do 
ai = ai A do 


(7.4.3) 


In general there will be an exclusive-or for each nonzero term in the primitive 
polynomial except 0 and n. The nice feature about Method II is that all the 
exclusive-or’s can usually be done as a single full-word exclusive-or operation: 

#define IB1 1 Powers of 2. 

#define IB2 2 

#define IB5 16 

#define IB18 131072 

#define MASK (IB1+IB2+IB5) 

int irbit2(unsigned long *iseed) 

Returns as an integer a random bit, based on the 18 low-significance bits in iseed (which is 
modified for the next call). 

{ 

if (*iseed & IB18) { Change all masked bits, shift, and put 1 into bit 1. 

*iseed=((*iseed " MASK) « 1) I IB1; 
return 1; 

> else { 



Shift and put 0 into bit 1. 
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random floating-point number. They are not very random for that purpose; see 
Knuth [1], Examples of acceptable uses of these random bits are: (i) multiplying a 
signal randomly by ±1 at a rapid “chip rate,” so as to spread its spectrum uniformly 
(but recoverably) across some desired bandpass, or (ii) Monte Carlo exploration 
of a binary tree, where decisions as to whether to branch left or right are to be 
made randomly. 

Now we do not want you to go through life thinking that there is something 
special about the primitive polynomial of degree 18 used in the above examples. 
(We chose 18 because 2 18 is small enough for you to verify our claims directly by 
numerical experiment.) The accompanying table [2] lists one primitive polynomial 
for each degree up to 100. (In fact there exist many such for each degree. For 
example, see §7.7 for a complete table up to degree 10.) 


CITED REFERENCES AND FURTHER READING: 

Knuth, D.E. 1981, Seminumerical Algorithms, 2nd ed., vol. 2 of The Art of Computer Programming 
(Reading, MA: Addison-Wesley), pp. 29ff. [1] 

Horowitz, R, and Hill, W. 1989, The Art of Electronics, 2nded. (Cambridge: Cambridge University 
Press), §§9.32-9.37. 

Tausworthe, R.C. 1965, Mathematics of Computation, vol. 19, pp. 201-209. 

Watson, E.J. 1962, Mathematics of Computation, vol. 16, pp. 368-369. [2] 


7.5 Random Sequences Based on Data 
Encryption 

In Numerical Recipes ’ first edition, we described how to use the Data Encryption Standard 
(DES) [1 -3] for the generation of random numbers. Unfortunately, when implemented in 
software in a high-level language like C, DES is very slow, so excruciatingly slow, in fact, that 
our previous implementation can be viewed as more mischievous than useful. Here we give 
a much faster and simpler algorithm which, though it may not be secure in the cryptographic 
sense, generates about equally good random numbers. 

DES, like its progenitor cryptographic system LUCIFER, is a so-called “block product 
cipher” [4], It acts on 64 bits of input by iteratively applying (16 times, in fact) a kind of highly 
nonlinear bit-mixing function. Figure 7.5.1 shows the flow of information in DES during 
this mixing. The function g, which takes 32-bits into 32-bits, is called the “cipher function.” 
Meyer and Matyas [4] discuss the importance of the cipher function being nonlinear, as well 
as other design criteria. 

DES constructs its cipher function g from an intricate set of bit permutations and table 
lookups acting on short sequences of consecutive bits. Apparently, this function was chosen 
to be particularly strong cryptographically (or conceivably as some critics contend, to have 
an exquisitely subtle cryptographic flaw!). For our purposes, a different function g that can 
be rapidly computed in a high-level computer language is preferable. Such a function may 
weaken the algorithm cryptographically. Our purposes are not, however, cryptographic: We 
want to find the fastest g, and smallest number of iterations of the mixing procedure in Figure 
7.5.1, such that our output random sequence passes the standard tests that are customarily 
applied to random number generators. The resulting algorithm will not be DES, but rather a 
kind of “pseudo-DES,” better suited to the purpose at hand. 

Following the criterion, mentioned above, that g should be nonlinear, we must give the 
integer multiply operation a prominent place in g. Because 64-bit registers are not generally 
accessible in high-level languages, we must confine ourselves to multiplying 16-bit operands 
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into a 32-bit result. So, the general idea of g, almost forced, is to calculate the three 
distinct 32-bit products of the high and low 16-bit input half-words, and then to combine 
these, and perhaps additional fixed constants, by fast operations (e.g., add or exclusive-or) 
into a single 32-bit result. 

There are only a limited number of ways of effecting this general scheme, allowing 
systematic exploration of the alternatives. Experimentation, and tests of the randomness of 
the output, lead to the sequence of operations shown in Figure 7.5.2. The few new elements 
in the figure need explanation: The values Ci and C 2 are fixed constants, chosen randomly 
with the constraint that they have exactly 16 1-bits and 16 0-bits; combining these constants 
via exclusive-or ensures that the overall g has no bias towards 0 or 1 bits. 

The “reverse half-words” operation in Figure 7.5.2 turns out to be essential; otherwise, 
the very lowest and very highest bits are not properly mixed by the three multiplications. 
The nonobvious choices in g are therefore: where along the vertical “pipeline” to do the 
reverse; in what order to combine the three products and C 2 ; and with which operation (add 
or exclusive-or) should each combining be done? We tested these choices exhaustively before 
settling on the algorithm shown in the figure. 

It remains to determine the smallest number of iterations Nu that we can get away with. 
The minimum meaningful Nu is evidently two, since a single iteration simply moves one 
32-bit word without altering it. One can use the constants Ci and C2 to help determine an 
appropriate Nu: When Nu = 2 and Ci = C| = 0 (an intentionally very poor choice), the 
generator fails several tests of randomness by easily measurable, though not overwhelming, 
amounts. When Nu = 4, on the other hand, or with Nu = 2 but with the constants 
Ci, C2 nonsparse, we have been unable to find any statistical deviation from randomness in 
sequences of up to 10 9 floating numbers r, derived from this scheme. The combined strength 
of Nu = 4 and nonsparse Ci, C2 should therefore give sequences that are random to tests 
even far beyond those that we have actually tried. These are our recommended conservative 
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parameter values, notwithstanding the fact that Nu = 2 (which is, of course, twice as fast) 
has no nonrandomness discernible (by us). 

Implementation of these ideas is straightforward. The following routine is not quite 
strictly portable, since it assumes that unsigned long integers are 32-bits, as is the case 
on most machines. However, there is no reason to believe that longer integers would be in 
any way inferior (with suitable extensions of the constants Ci,C 2 ). C does not provide a 
convenient, portable way to divide a long integer into half words, so we must use a combination 
of masking (& Oxffff) with left- and right-shifts by 16 bits («16 and >>16). On some 
machines the half-word extraction could be made faster by the use of C’s union construction, 
but this would generally not be portable between “big-endian” and “little-endian” machines. 
(Big- and little-endian refer to the order in which the bytes are stored in a word.) 

#define NITER 4 

void psdes(unsigned long *lword, unsigned long *irword) 

“Pseudo-DES” hashing of the 64-bit word (lword,irword). Both 32-bit arguments are re¬ 
turned hashed on all bits. 

{ 

unsigned long i,ia,ib,iswap,itmph=0,itmpl=0; 
static unsigned long cl[NITER]={ 

0xbaa96887L, 0xlel7d32cL, 0x03bcdc3cL, 0x0f33dlb2L>; 
static unsigned long c2[NITER]={ 

0x4b0f3b58L, 0xe874f0c3L, 0x6955c5a6L, 0x55a7ca46L>; 

for (i=0;i<NITER;i++) { 

Perform niter iterations of DES logic, using a simpler (non-cryptographic) nonlinear func¬ 
tion instead of DES's. 
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ia=(iswap=(*irword)) ~ cl[i]; The bit-rich constants cl and (below) 

itmpl = ia & Oxffff; c2 guarantee lots of nonlinear mix- 

itmph = ia » 16; ing. 

ib=itmpl*itmpl+ ~(itmph*itmph); 

*irword=(*lword) " (((ia = (ib >> 16) I 

((ib & Oxffff) « 16)) " c2[i])+itmpl*itmph); 

*lword=iswap; 

> 

> 


The routine ran4, listed below, uses psdes to generate uniform random deviates. We 
adopt the convention that a negative value of the argument idum sets the left 32-bit word, while 
a positive value i sets the right 32-bit word, returns the ith random deviate, and increments 
idum to i + 1. This is no more than a convenient way of defining many different sequences 
(negative values of idum), but still with random access to each sequence (positive values of 
idum). For getting a floating-point number from the 32-hit integer, we like to do it by the 
masking trick described at the end of §7.1, above. The hex constants 3F800000 and 007FFFFF 
are the appropriate ones for computers using the IEEE representation for 32-bit floating-point 
numbers (e.g., IBM PCs and most UNIX workstations). For DEC VAXes, the correct hex 
constants are, respectively, 00004080 and FFFF007F. For greater portability, you can instead 
construct a floating number by making the (signed) 32-bit integer nonnegative (typically, you 
add exactly 2 31 if it is negative) and then multiplying it by a floating constant (typically 2. 1 ). 

An interesting, and sometimes useful, feature of the routine ran4, below, is that it allows 
random access to the nth random value in a sequence, without the necessity of first generating 
values 1 • • • n — 1. This property is shared by any random number generator based on hashing 
(the technique of mapping data keys, which may be highly clustered in value, approximately 
uniformly into a storage address space) [5,6]. One might have a simulation problem in which 
some certain rare situation becomes recognizable by its consequences only considerably after 
it has occurred. One may wish to restart the simulation back at that occurrence, using identical 
random values but, say, varying some other control parameters. The relevant question might 
then be something like “what random numbers were used in cycle number 337098901?” It 
might already be cycle number 395100273 before the question comes up. Random generators 
based on recursion, rather than hashing, cannot easily answer such a question. 

float ran4(long *idum) 

Returns a uniform random deviate in the range 0.0 to 1.0, generated by pseudo-DES (DES- 
like) hashing of the 64-bit word (idums, idum), where idums was set by a previous call with 
negative idum. Also increments idum. Routine can be used to generate a random sequence 
by successive calls, leaving idum unaltered between calls; or it can randomly access the nth 
deviate in a sequence by calling with idum = n. Different sequences are initialized by calls with 
differing negative values of idum. 

{ 

void psdes(unsigned long *lword, unsigned long *irword); 
unsigned long irword,itemp,lword; 
static long idums = 0; 

The hexadecimal constants jflone and jflmsk below are used to produce a floating number 
between 1. and 2. by bitwise masking. They are machine-dependent. See text. 

#if defined(vax) I I defined(_vax_) I I defined(_vax_) I I defined(VAX) 

static unsigned long jflone = 0x00004080; 
static unsigned long jflmsk = 0xffff007f; 

#else 

static unsigned long jflone = 0x3f800000; 
static unsigned long jflmsk = 0x007fffff; 

#endif 

if (*idum < 0) { Reset idums and prepare to return the first 

idums = -(*idum); deviate in its sequence. 

*idum=l; 

> 

irword=(*idum); 
lword=idums; 
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psdes(&lword,&irword); 
itemp=jflone I (jflmsk & irword); 
++(*idum); 

return (*(float *)&itemp)-l.0; 


“Pseudo-DES" encode the words. 

Mask to a floating number between 1 and 
2 . 

Subtraction moves range to 0. to 1. 


The accompanying table gives data for verifying that ran4 and psdes work correctly 
on your machine. We do not advise the use of ran4 unless you are able to reproduce the 
hex values shown. Typically, ran4 is about 4 times slower than ranO (§7.1), or about 3 
times slower than rani. 


Values for Verifying the Implementation of psdes 


idum 

before psdes call 

after psdes call (hex) 

j ran4(idum) 


lword 

irword 

lword 

| irword 

VAX 

PC 

-1 

1 

1 

604D1DCE 

509C0C23 

0.275898 

0.219120 

99 

1 

99 

D97F8571 

A66CB41A 

0.208204 

0.849246 

-99 

99 

1 

7822309D 

64300984 

0.034307 

0.375290 

99 

99 

99 

D7F376F0 

59BA89EB 

0.838676 

0.457334 


Successive calls to ran4 with arguments —1, 99, —99, and 99 should produce exactly the 
lword and irword values shown. Masking conversion to a returned floating random value 
is allowed to be machine dependent; values for VAX and PC are shown. 


CITED REFERENCES AND FURTHER READING: 

Data Encryption Standard, 1977 January 15, Federal Information Processing Standards Publi¬ 
cation, number 46 (Washington: U.S. Department of Commerce, National Bureau of Stan¬ 
dards). [1] 

Guidelines for Implementing and Using the NBS Data Encryption Standard, 1981 April 1, Federal 
Information Processing Standards Publication, number 74 (Washington: U.S. Department 
of Commerce, National Bureau of Standards). [2] 

Validating the Correctness of Hardware Implementations of the NBS Data Encryption Stan¬ 
dard, 1980, NBS Special Publication 500-20 (Washington: U.S. Department of Commerce, 
National Bureau of Standards). [3] 

Meyer, C.H. and Matyas, S.M. 1982, Cryptography: A New Dimension in Computer Data Security 
(New York: Wiley). [4] 

Knuth, D.E. 1973, Sorting and Searching, vol. 3 of The Art of Computer Programming (Reading, 
MA: Addison-Wesley), Chapter 6. [5] 

Vitter, J.S., and Chen, W-C. 1987, Design and Analysis of Coalesced Hashing (New York: Oxford 
University Press). [6] 



7.6 Simple Monte Carlo Integration 




Inspirations for numerical methods can spring from unlikely sources. “Splines” 
first were flexible strips of wood used by draftsmen. “Simulated annealing” (we 
shall see in § 10.9) is rooted in a thermodynamic analogy. And who does not feel at 
least a faint echo of glamor in the name “Monte Carlo method”? 
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Suppose that we pick N random points, uniformly distributed in a multidimen¬ 
sional volume V. Call them x t .... ,xm- Then the basic theorem of Monte Carlo 
integration estimates the integral of a function / over the multidimensional volume, 

J fdV^V(f) ± (7.6.1) 

Here the angle brackets denote taking the arithmetic mean over the N sample points, 

(/> s l E /&} </ 2 > = l E A*<) < 7 6 - 2 > 

The “plus-or-minus” term in (7.6.1) is a one standard deviation error estimate for 
the integral, not a rigorous bound; further, there is no guarantee that the error is 
distributed as a Gaussian, so the error term should be taken only as a rough indication 
of probable error. 

Suppose that you want to integrate a function g over a region W that is not 
easy to sample randomly. For example, W might have a very complicated shape. 
No problem. Just find a region V that includes W and that can easily be sampled 
(Figure 7.6.1), and then define / to be equal to g for points in W and equal to zero 
for points outside of W (but still inside the sampled V). You want to try to make 
V enclose W as closely as possible, because the zero values of / will increase the 
error estimate term of (7.6.1). And well they should: points chosen outside of W 
have no information content, so the effective value of N, the number of points, is 
reduced. The error estimate in (7.6.1) takes this into account. 

General purpose routines for Monte Carlo integration are quite complicated 
(see §7.8), but a worked example will show the underlying simplicity of the method. 
Suppose that we want to find the weight and the position of the center of mass of an 
object of complicated shape, namely the intersection of a torus with the edge of a 
large box. In particular let the object be defined by the three simultaneous conditions 



(torus centered on the origin with major radius = 4, minor radius = 2) 

* > 1 y > —3 (7.6.4) 

(two faces of the box, see Figure 7.6.2). Suppose for the moment that the object 
has a constant density p. 

We want to estimate the following integrals over the interior of the complicated 
object: 


p dx dy dz 


xp dx dy dz 


yp dx dy dz 


The coordinates of the center of mass will be the ratio of the latter three integrals 
(linear moments) to the first one (the weight). 

In the following fragment, the region V, enclosing the piece-of-torus W, is the 


rectangular box extending from 1 to 4 in x, —3 to 4 in y, and I to 1 in 2 . 
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#include "nrutil.h" 

n=. . . 
den=... 

sw=swx=swy=swz=0.0; 
varw=varx=vary=varz=0.0; 
vol=3.0*7.0*2.0; 
for(j=l;j<=n;j++) { 

x=l.0+3.0*ran2(feidum); 
y=(-3.0)+7.0*ran2(&idum); 
z=(-l,0)+2.0*ran2(&idum); 
if (z*z+SQR(sqrt(x*x+y*y)-3.0) 
sw += den; 
swx += x*den; 
swy += y*den; 
swz += z*den; 
varw += SQR(den); 
varx += SQR(x*den); 
vary += SQR(y*den); 
varz += SQR(z*den); 

> 

> 

w=vol*sw/n; 

x=vol*swx/n; 

y=vol*swy/n; 

z=vol*swz/n; 

dw=vol*sqrt((varw/n-SQR(sw/n))/n); 
dx=vol*sqrt((varx/n-SQR(swx/n))/n) 
dy=vol*sqrt((vary/n-SQR(swy/n)) /n) 
dz=vol*sqrt((varz/n-SQR(swz/n))/n) 


Set to the number of sample points desired. 
Set to the constant value of the density. 
Zero the various sums to be accumulated. 

Volume of the sampled region. 

Pick a point randomly in the sampled re¬ 
gion. 

< 1.0) { Is it in the torus? 

If so, add to the various cumulants. 


The values of the integrals (7.6.5), 


and their corresponding error estimates. 


A change of variable can often be extremely worthwhile in Monte Carlo 
integration. Suppose, for example, that we want to evaluate the same integrals, 
but for a piece-of-torus whose density is a strong function of z, in fact varying 
according to 


p(x , y, z) = e 5z (7.6.6) 

One way to do this is to put the statement 


den=exp(5.0*z); 


inside the if (...) block, just before den is first used. This will work, but it is 
a poor way to proceed. Since (7.6.6) falls so rapidly to zero as 2 decreases (down 
to its lower limit —1), most sampled points contribute almost nothing to the sum 
of the weight or moments. These points are effectively wasted, almost as badly as 
those that fall outside of the region W. A change of variable, exactly as in the 
transformation methods of §7.2, solves this problem. Let 

ds = e 5z dz so that s = -e 5z , z=-ln(5s) (7.6.7) 



Then pdz = ds , and the limits — 1 < z < 1 become .00135 < s < 29.682. The 
program fragment now looks like this 
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n=. . . 

sw=swx=swy=swz=0.0; 
varw=varx=vary=varz=0.0; 
ss=0.2*(exp(5.0)-exp(-5.0)) 
vol=3.0*7.0*ss 
for(j=l;j<=n;j++) { 

x=l.0+3.0*ran2(&idum); 
y=(-3.0)+7.0*ran2(&iduin) ; 
s=0.00135+ss*ran2(&iduin) ; 
z=0.2*log(5.0*s); 
if (z*z+SQR(sqrt(x*x+y*y)-3.0) < 
sw += 1.0; 
swx += x; 
swy += y; 
swz += z; 
varw += 1.0; 
varx += x*x; 
vary += y*y; 
varz += z*z; 

> 

> 

w=vol*sw/n; 

x=vol*swx/n; 

y=vol*swy/n; 

z=vol*swz/n; 

dw=vol*sqrt((varw/n-SQR(sw/n))/n); 
dx=vol*sqrt((varx/n-SQR(swx/n))/n); 
dy=vol*sqrt((vary/n-SQR(swy/n))/n); 
dz=vol*sqrt((varz/n-SQR(swz/n))/n); 


Set to the number of sample points desired. 


Interval of s to be random sampled. 
Volume in x,y,s-space. 


Pick a point in s. 

Equation (7.6.7). 

1 . 0 ) { 

Density is 1, since absorbed into definition 
of s. 


The values of the integrals (7.6.5), 


and their corresponding error estimates. 


If you think for a minute, you will realize that equation (7.6.7) was useful only 
because the part of the integrand that we wanted to eliminate (e 5z ) was both integrable 
analytically, and had an integral that could be analytically inverted. (Compare §7.2.) 
In general these properties will not hold. Question: What then? Answer: Pull out 
of the integrand the “best” factor that can be integrated and inverted. The criterion 
for “best” is to try to reduce the remaining integrand to a function that is as close 
as possible to constant. 

The limiting case is instructive: If you manage to make the integrand / exactly 
constant, and if the region V, of known volume, exactly encloses the desired region 
W, then the average of / that you compute will be exactly its constant value, and the 
error estimate in equation (7.6.1) will exactly vanish. You will, in fact, have done 
the integral exactly, and the Monte Carlo numerical evaluations are superfluous. So, 
backing off from the extreme limiting case, to the extent that you are able to make / 
approximately constant by change of variable, and to the extent that you can sample a 
region only slightly larger than W, you will increase the accuracy of the Monte Carlo 
integral. This technique is generically called reduction of variance in the literature. 

The fundamental disadvantage of simple Monte Carlo integration is that its 
accuracy increases only as the square root of N, the number of sampled points. If 
your accuracy requirements are modest, or if your computer budget is large, then 
the technique is highly recommended as one of great generality. In the next two 
sections we will see that there are techniques available for “breaking the square root 
of N barrier” and achieving, at least in some cases, higher accuracy with fewer 
function evaluations. 
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CITED REFERENCES AND FURTHER READING: 

Hammersley, J.M., and Handscomb, D.C. 1964, Monte Carlo Methods (London: Methuen). 
Shreider, Yu. A. (ed.) 1966, The Monte Carlo Method (Oxford: Pergamon). 

Sobol', I.M. 1974, The Monte Carlo Method (Chicago: University of Chicago Press). 

Kalos, M.H., and Whitlock, P.A. 1986, Monte Carlo Methods (New York: Wiley). 


7.7 Quasi- (that is, Sub-) Random Sequences 

We have just seen that choosing N points uniformly randomly in an n- 
dimensional space leads to an error term in Monte Carlo integration that decreases 
as 1/ s/N. In essence, each new point sampled adds linearly to an accumulated sum 
that will become the function average, and also linearly to an accumulated sum of 
squares that will become the variance (equation 7.6.2). The estimated error comes 
from the square root of this variance, hence the power N -1 / 2 . 

Just because this square root convergence is familiar does not, however, mean 
that it is inevitable. A simple counterexample is to choose sample points that lie 
on a Cartesian grid, and to sample each grid point exactly once (in whatever order). 
The Monte Carlo method thus becomes a deterministic quadrature scheme — albeit 
a simple one — whose fractional error decreases at least as fast as IV -1 (even faster 
if the function goes to zero smoothly at the boundaries of the sampled region, or 
is periodic in the region). 

The trouble with a grid is that one has to decide in advance how fine it should 
be. One is then committed to completing all of its sample points. With a grid, it is 
not convenient to “sample until” some convergence or termination criterion is met. 
One might ask if there is not some intermediate scheme, some way to pick sample 
points “at random,” yet spread out in some self-avoiding way, avoiding the chance 
clustering that occurs with uniformly random points. 

A similar question arises for tasks other than Monte Carlo integration. We might 
want to search an n-dimensional space for a point where some (locally computable) 
condition holds. Of course, for the task to be computationally meaningful, there 
had better be continuity, so that the desired condition will hold in some finite n- 
dimensional neighborhood. We may not know a priori how large that neighborhood 
is, however. We want to “sample until” the desired point is found, moving smoothly 
to finer scales with increasing samples. Is there any way to do this that is better 
than uncorrelated, random samples? 

The answer to the above question is “yes.” Sequences of n-tuples that fill 
n-space more uniformly than uncorrelated random points are called quasi-random 
sequences. That term is somewhat of a misnomer, since there is nothing “random” 
about quasi-random sequences: They are cleverly crafted to be, in fact, sub-random. 
The sample points in a quasi-random sequence are, in a precise sense, “maximally 
avoiding” of each other. 

A conceptually simple example is Halton’s sequence [1 ]. In one dimension, the 
jth number Hj in the sequence is obtained by the following steps: (i) Write j as a 
number in base b , where b is some prime. (For example j = 17 in base b = 3 is 122.) 
(ii) Reverse the digits and put a radix point (i.e., a decimal point base b ) in front of 
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points 1 to 128 


points 129 to 512 




Figure 7.7.1. First 1024 points of a two-dimensional Sobol’ sequence. The sequence is generated 
number-theoretically, rather than randomly, so successive points at any stage “know” how to fill in the 
gaps in the previously generated distribution. 


the sequence. (In the example, we get 0.221 base 3.) The result is Hj. To get a 
sequence of n-tuples in n-space, you make each component a Halton sequence with 
a different prime base b. Typically, the first n primes are used. 

It is not hard to see how Halton’s sequence works: Every time the number of 
digits in j increases by one place, j’s digit-reversed fraction becomes a factor of 
b finer-meshed. Thus the process is one of filling in all the points on a sequence 
of finer and finer Cartesian grids — and in a kind of maximally spread-out order 
on each grid (since, e.g., the most rapidly changing digit in j controls the most 
significant digit of the fraction). 

Other ways of generating quasi-random sequences have been suggested by 
Fame, Sobol’, Niederreiter, and others. Bratley and Fox [2] provide a good review 
and references, and discuss a particularly efficient variant of the Sobol’ [3] sequence 
suggested by Antonov and Saleev [4], It is this Antonov-Saleev variant whose 
implementation we now discuss. 
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Degree 

Primitive Polynomials Modulo 2* 

1 

0 (i.e., x + 1) 

2 

1 (i.e., x 2 + x + 1) 

3 

1, 2 (i.e., x 3 + x + 1 and x 3 + x 2 + 1) 

4 

1, 4 (i.e., x 4 + x + 1 and x 4 + x 3 + 1) 

5 

2, 4, 7, 11, 13, 14 

6 

1, 13, 16, 19, 22, 25 

7 

1, 4, 7, 8, 14, 19, 21, 28, 31, 32, 37, 41, 42, 50, 55, 56, 59, 62 

8 

14, 21, 22, 38, 47, 49, 50, 52, 56, 67, 70, 84, 97, 103, 115, 122 

9 

8, 13, 16, 22, 25, 44, 47, 52, 55, 59, 62, 67, 74, 81, 82, 87, 91, 94, 103, 104, 109, 122, 

124, 137, 138, 143, 145, 152, 157, 167, 173, 176, 181, 182, 185, 191, 194, 199, 218, 220, 

227, 229, 230, 234, 236, 241, 244, 253 

10 

4, 13, 19, 22, 50, 55, 64, 69, 98, 107, 115, 121, 127, 134, 140, 145, 152, 158, 161, 171, 

181, 194, 199, 203, 208, 227, 242, 251, 253, 265, 266, 274, 283, 289, 295, 301, 316, 

319, 324, 346, 352, 361, 367, 382, 395, 398, 400, 412, 419, 422, 426, 428, 433, 446, 

454, 457, 472, 493, 505, 508 

*Expressed as a decimal integer representing the interior bits (that is, omitting the 
high-order bit and the unit bit). 


The Sobol’ sequence generates numbers between zero and one directly as binary fractions 
of length w bits, from a set of w special binary fractions, V, i = 1 . 2 ,.... w, called direction 
numbers. In Sobol’s original method, the jth number X :j is generated by XORing (bitwise 
exclusive or) together the set of Vi’s satisfying the criterion on i, “the ith bit of j is nonzero.” 
As j increments, in other words, different ones of the Vi’s flash in and out of Xj on different 
time scales. Vi alternates between being present and absent most quickly, while 14 goes from 
present to absent (or vice versa) only every 2 fc_1 steps. 

Antonov and Saleev’s contribution was to show that instead of using the bits of the 
integer j to select direction numbers, one could just as well use the bits of the Gray code of 
j, G(j). (For a quick review of Gray codes, look at §20.2.) 

Now G(j) and G(j + 1) differ in exactly one bit position, namely in the position of the 
rightmost zero bit in the binary representation of j (adding a leading zero to j if necessary). A 
consequence is that the j + 1st Sobol’-Antonov-Saleev number can be obtained from the jth 
by XORing it with a single V t , namely with i the position of the rightmost zero hit in j. This 
makes the calculation of the sequence very efficient, as we shall see. 

Figure 7.7.1 plots the first 1024 points generated by a two-dimensional Sobol’ sequence. 
One sees that successive points do “know” about the gaps left previously, and keep filling 
them in, hierarchically. 

We have deferred to this point a discussion of how the direction numbers 14 are generated. 
Some nontrivial mathematics is involved in that, so we will content ourself with a cookbook 
summary only: Each different Sobol’ sequence (or component of an n-dimensional sequence) 
is based on a different primitive polynomial over the integers modulo 2, that is, a polynomial 
whose coefficients are either 0 or 1, and which generates a maximal length shift register 
sequence. (Primitive polynomials modulo 2 were used in §7.4, and are further discussed in 
§20.3.) Suppose P is such a polynomial, of degree q, 

P = x q + aix 9-1 + a 2 X q ~ 2 -\ - b a q -ix + 1 



(7.7.1) 
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j Initializing Values Used in sobseq | 

Degree 

Polynomial 

Starting Values | 

1 

0 

1 

(3) 

(5) 

(15) ... 

2 

1 

1 

1 

(7) 

(11) ... 

3 

1 

1 

3 

7 

(5) ... 

3 

2 

1 

3 

3 

(15) ... 

4 

1 

1 

1 

3 

13 ... 

4 

4 

1 

1 

5 

9 ... 

Parenthesized values are not freely specifiable, but are forced by the required recurrence 
for this degree. 


Define a sequence of integers Mi by the (/-term recurrence relation. 

Mi = 2aiMi-i ® 2 2 a 2 Mi- 2 ® • • • ® 2' 3 " 1 M i - ?+ ia 9 _i ® (2 q Mi- q ® M»_ g ) (7.7.2) 

Here bitwise XOR is denoted by ®. The starting values for this recurrence are that Mi, ..., M q 
can be arbitrary odd integers less than 2,... ,2 q , respectively. Then, the direction numbers 
Vi are given by 

V = Miff i = (7.7.3) 

The accompanying table lists all primitive polynomials modulo 2 with degree q < 10. 
Since the coefficients are either 0 or 1, and since the coefficients of x q and of 1 are predictably 
1, it is convenient to denote a polynomial by its middle coefficients taken as the bits of a binary 
number (higher powers of x being more significant bits). The table uses this convention. 

Turn now to the implementation of the Sobol’ sequence. Successive calls to the function 
sobseq (after a preliminary initializing call) return successive points in an n-dimensional 
Sobol’ sequence based on the first n primitive polynomials in the table. As given, the routine 
is initialized for maximum n of 6 dimensions, and for a word length w of 30 bits. These 
parameters can be altered by changing MAXBIT (= w) and MAXDIM, and by adding more 
initializing data to the arrays ip (the primitive polynomials from the table), mdeg (their 
degrees), and iv (the starting values for the recurrence, equation 7.7.2). A second table, 
above, elucidates the initializing data in the routine. 

#include "nrutil.h" 

#define MAXBIT 30 
#define MAXDIM 6 

void sobseq(int *n, float x[]) 

When n is negative, internally initializes a set of MAXBIT direction numbers for each of MAXDIM 
different Sobol’ sequences. When n is positive (but <MAXDIM), returns as the vector x[l. .n] 
the next values from n of these sequences, (n must not be changed between initializations.) 

{ 

int j,k,1; 

unsigned long i,im,ipp; 
static float fac; 

static unsigned long in,ix[MAXDIM+1],*iu[MAXBIT+l]; 
static unsigned long mdeg[MAXDIM+1]={0,1,2,3,3,4,4>; 
static unsigned long ip[MAXDIM+1]={0,0,1,1,2,1,4}; 
static unsigned long iv[MAXDIM+MAXBIT+1]={ 

0,1,1,1,1,1,1,3,1,3,3,1,1,5,7,7,3,3,5,15,11,5,15,13,9}; 



if (*n < 0) { 

for (k=l;k<=MAXDIM;k++) ix[k]=0; 


Initialize, don’t return a vector. 
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> 


in=0; 

if (iv[l] != 1) return; 
fac=l.0/(lL « MAXBIT); 

for (j=l,k=0;j<=MAXBIT;j++,k+=MAXDIM) iu[j] = &iv[k] ; 

To allow both ID and 2D addressing, 
for (k=l;k<=MAXDIM;k++) { 

for (j=l; j<=mdeg[k] ; j++) iu[j] [k] «= (MAXBIT-j); 

Stored values only require normalization. 

for (j=mdeg[k]+l; j<=MAXBIT; j++) { Use the recurrence to get other val- 
ipp=ip [k] ; ues. 

i=iu [j -mdeg [k] ] [k] ; 
i “= (i » mdeg[k]); 
for (l=mdeg[k]-l;l>=l;l—) { 

if (ipp & 1) i *= iu[j-l] [k] ; 
ipp »= 1; 

> 

iuCj] [k]=i; 

> 


} 

> else { 

im=in++; 

for (j=l;j <=MAXBIT;j ++) { 
if (!(im & 1)) break; 


Calculate the next vector in the se¬ 
quence. 

Find the rightmost zero bit. 


im »= 1; 

} 

if (j > MAXBIT) nrerrorC"MAXBIT too 
im=(j-1)+MAXDIM; 

for (k=l;k<=IMIN(*n,MAXDIM);k++) { 
ix [k] ~= iv [im+k] ; 
x[k]=ix[k] *fac; 

} 


small in sobseq"); 

XOR the appropriate direction num¬ 
ber into each component of the 
vector and convert to a floating 
number. 


How good is a Sobol’ sequence, anyway? For Monte Carlo integration of a smooth 
function in n dimensions, the answer is that the fractional error will decrease with N, the 
number of samples, as (In N) n /N, i.e., almost as fast as 1 /N. As an example, let us integrate 
a function that is nonzero inside a torus (doughnut) in three-dimensional space. If the major 
radius of the torus is Ho, the minor radial coordinate r is defined by 


m ([(* 2 +t/ 2 ) i/ 2 -Ho] 2 +z 2 ) 1/2 

Let us try the function 


/(*, y, z) 


1 4- cos 

0 



r <ro 


r>r o 


(7.7.4) 


(7.7.5) 


which can be integrated analytically in cylindrical coordinates, giving 



dx dy dz /(#, t/, z) — 2n 2 a 2 Ro 


(7.7.6) 


With parameters Ro = 0.6, r o = 0.3, we did 100 successive Monte Carlo integrations of 
equation (7.7.4), sampling uniformly in the region — 1 < x,y, z < 1, for the two cases of 
uncorrelated random points and the Sobol’ sequence generated by the routine sobseq. Figure 
7.7.2 shows the results, plotting the r.m.s. average error of the 100 integrations as a function 
of the number of points sampled. (For any single integration, the error of course wanders 
from positive to negative, or vice versa, so a logarithmic plot of fractional error is not very 
informative.) The thin, dashed curve corresponds to uncorrelated random points and shows 
the familiar N _1 ^ 2 asymptotics. The thin, solid gray curve shows the result for the Sobol’ 


S. I | 
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number of points N 

Figure 7.7.2. Fractional accuracy of Monte Carlo integrations as a function of number of points sampled, 
for two different integrands and two different methods of choosing random points. The quasi-random 
Sobol’ sequence converges much more rapidly than a conventional pseudo-random sequence. Quasi¬ 
random sampling does better when the integrand is smooth (“soft boundary”) than when it has step 
discontinuities (“hard boundary”). The curves shown are the r.m.s. average of 100 trials. 

sequence. The logarithmic term in the expected (In N) 3 /N is readily apparent as curvature 
in the curve, hut the asymptotic TV -1 is unmistakable. 

To understand the importance of Figure 7.7.2, suppose that a Monte Carlo integration of 
/ with 1% accuracy is desired. The Sohol’ sequence achieves this accuracy in a few thousand 
samples, while pseudorandom sampling requires nearly 100,000 samples. The ratio would 
be even greater for higher desired accuracies. 

A different, not quite so favorable, case occurs when the function being integrated has 
hard (discontinuous) boundaries inside the sampling region, for example the function that is 
one inside the torus, zero outside, 

Aw> = {J P-MS 

where r is defined in equation (7.7.4). Not by coincidence, this function has the same analytic 
integral as the function of equation (7.7.5), namely 2ii 2 a 2 Ro. 

The carefully hierarchical Sobol’ sequence is based on a set of Cartesian grids, but the 
boundary of the torus has no particular relation to those grids. The result is that it is essentially 
random whether sampled points in a thin layer at the surface of the torus, containing on the 
order of N 2 ^ 3 points, come out to be inside, or outside, the torus. The square root law, applied 
to this thin layer, gives N l < 3 fluctuations in the sum, or N~ 2 ^ 3 fractional error in the Monte 
Carlo integral. One sees this behavior verified in Figure 7.7.2 by the thicker gray curve. The 
thicker dashed curve in Figure 7.7.2 is the result of integrating the function of equation (7.7.7) 
using independent random points. While the advantage of the Sobol’ sequence is not quite so 
dramatic as in the case of a smooth function, it can nonetheless be a significant factor (~5) 
even at modest accuracies like 1%, and greater at higher accuracies. 
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Note that we have not provided the routine sobseq with a means of starting the 
sequence at a point other than the beginning, but this feature would be easy to add. Once 
the initialization of the direction numbers iv has been done, the j th point can be obtained 
directly by XORing together those direction numbers corresponding to nonzero bits in the 
Gray code of j, as described above. 

The Latin Hypercube 

We might here give passing mention the unrelated technique of Latin square or 
Latin hypercube sampling, which is useful when you must sample an A’-dimensional 
space exceedingly sparsely, at M points. For example, you may want to test the 
crashworthiness of cars as a simultaneous function of 4 different design parameters, 
but with a budget of only three expendable cars. (The issue is not whether this is a 
good plan — it isn’t — but rather how to make the best of the situation!) 

The idea is to partition each design parameter (dimension) into M segments, so 
that the whole space is partitioned into M N cells. (You can choose the segments in 
each dimension to be equal or unequal, according to taste.) With 4 parameters and 3 
cars, for example, you end up with 3x3x3x3 = 81 cells. 

Next, choose M cells to contain the sample points by the following algorithm: 
Randomly choose one of the M N cells for the first point. Now eliminate all cells 
that agree with this point on any of its parameters (that is, cross out all cells in the 
same row, column, etc.), leaving (M I ) N candidates. Randomly choose one of 
these, eliminate new rows and columns, and continue the process until there is only 
one cell left, which then contains the final sample point. 

The result of this construction is that each design parameter will have been 
tested in every one of its subranges. If the response of the system under test is 
dominated by one of the design parameters, that parameter will be found with this 
sampling technique. On the other hand, if there is an important interaction among 
different design parameters, then the Latin hypercube gives no particular advantage. 
Use with care. 
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7.8 Adaptive and Recursive Monte Carlo 
Methods 


This section discusses more advanced techniques of Monte Carlo integration. As 
examples of the use of these techniques, we include two rather different, fairly sophisticated, 
multidimensional Monte Carlo codes: vegas [1,2], and miser [3], The techniques that we 
discuss all fall under the general rubric of reduction of variance (§7.6), but are otherwise 
quite distinct. 

Importance Sampling 


The use of importance sampling was already implicit in equations (7.6.6) and (7.6.7). 
We now return to it in a slightly more formal way. Suppose that an integrand / can be written 
as the product of a function h that is almost constant times another, positive, function g. Then 
its integral over a multidimensional volume V is 

J fdV = J ( f/g)gdV = j hgdV (7.8.1) 

In equation (7.6.7) we interpreted equation (7.8.1) as suggesting a change of variable to 
G, the indefinite integral of g. That made gdV a perfect differential. We then proceeded 
to use the basic theorem of Monte Carlo integration, equation (7.6.1). A more general 
interpretation of equation (7.8.1) is that we can integrate / by instead sampling h — not, 
however, with uniform probability density dV, but rather with nonuniform density gdV. In 
this second interpretation, the first interpretation follows as the special case, where the means 
of generating the nonuniform sampling of gdV is via the transformation method, using the 
indefinite integral G (see §7.2). 

More directly, one can go back and generalize the basic theorem (7.6.1) to the case 
of nonuniform sampling: Suppose that points Xi are chosen within the volume V with a 
probability density p satisfying 



(7.8.2) 


The generalized fundamental theorem is that the integral of any function / is estimated, using 
N sample points Xi ,..., xn, by 


<„ 3 ) 

where angle brackets denote arithmetic means over the N points, exactly as in equation 
(7.6.2). As in equation (7.6.1), the “plus-or-minus” term is a one standard deviation error 
estimate. Notice that equation (7.6.1) is in fact the special case of equation (7.8.3), with 
p = constant = 1/17. 

What is the best choice for the sampling density pi Intuitively, we have already seen 
that the idea is to make h = f/p as close to constant as possible. We can be more rigorous 
by focusing on the numerator inside the square root in equation (7.8.3), which is the variance 
per sample point. Both angle brackets are themselves Monte Carlo estimators of integrals, 
so we can write 

ss (?) - ay “ / 7 pdv - a i h’=/ r iv - U fdv i as - 4) 

We now find the optimal p subject to the constraint equation (7.8.2) by the functional variation 

°~i(j f f dV -[J ,dv \ 2+> ‘J pdv 



(7.8.5) 


imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 




7.8 Adaptive and Recursive Monte Carlo Methods 


317 


with A a Lagrange multiplier. Note that the middle term does not depend on p. The variation 
(which comes inside the integrals) gives 0 = — f 2 /p 2 + A or 


\A 

Va 


I/I 

f\f\dV 


(7.8.6) 


where A has been chosen to enforce the constraint (7.8.2). 

If / has one sign in the region of integration, then we get the obvious result that the 
optimal choice of p — if one can figure out a practical way of effecting the sampling — is 
that it be proportional to |/|. Then the variance is reduced to zero. Not so obvious, but seen 
to be true, is the fact that p oc |/| is optimal even if / takes on both signs. In that case the 
variance per sample point (from equations 7.8.4 and 7.8.6) is 


S = 



(7.8.7) 


One curiosity is that one can add a constant to the integrand to make it all of one sign, 
since this changes the integral by a known amount, constant x V. Then, the optimal choice 
of p always gives zero variance, that is, a perfectly accurate integral! The resolution of 
this seeming paradox (already mentioned at the end of §7.6) is that perfect knowledge of p 
in equation (7.8.6) requires perfect knowledge of f \f\dV, which is tantamount to already 
knowing the integral you are trying to compute! 

If your function / takes on a known constant value in most of the volume V, it is 
certainly a good idea to add a constant so as to make that value zero. Having done that, the 
accuracy attainable by importance sampling depends in practice not on how small equation 
(7.8.7) is, but rather on how small is equation (7.8.4) for an implementable p, likely only a 
crude approximation to the ideal. 


Stratified Sampling 


The idea of stratified sampling is quite different from importance sampling. Let us 
expand our notation slightly and let ((/)) denote the true average of the function / over 
the volume V (namely the integral divided by V), while (/) denotes as before the simplest 
(uniformly sampled) Monte Carlo estimator of that average: 

«/»= yj fdv </> = ^ E /(*o ( 7 - 8 - 8 ) 

The variance of the estimator, Var ((/)), which measures the square of the error of the 
Monte Carlo integration, is asymptotically related to the variance of the function, Var (/) = 
«/ 2 » - «/» 2 > by the relation 

Var ((/» = ^) (7.8.9) 

(compare equation 7.6.1). 

Suppose we divide the volume V into two equal, disjoint subvolumes, denoted a and 6, 
and sample N/2 points in each subvolume. Then another estimator for ((/)), different from 
equation (7.8.8), which we denote (/)', is 

</>'^(</>a+ </>*,) (7-8.10) 

in other words, the mean of the sample averages in the two half-regions. The variance of 
estimator (7.8.10) is given by 

Var ((/)') = \ [Var «/>„)+ Var «/>,)] 

= 1 [ VarSfty Var »(/)g|' 

4 [ N/2 + N/2 \ 

~ oiu [Vata (/) + Vart (/)] 



(7.8.11) 
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Here Var a (/) denotes the variance of / in subregion a, that is, {{f 2 ))a — and 

correspondingly for b. 

From the definitions already given, it is not difficult to prove the relation 

Var (/) = i [Var 0 (/) + Var, (/)] + i («/» tt - ((/» 6 ) 2 (7.8.12) 

(In physics, this formula for combining second moments is the “parallel axis theorem.”) 
Comparing equations (7.8.9), (7.8.11), and (7.8.12), one sees that the stratified (into two 
subvolumes) sampling gives a variance that is never larger than the simple Monte Carlo case 
— and smaller whenever the means of the stratified samples, {{/))„ and ((/))t, are different. 

We have not yet exploited the possibility of sampling the two subvolumes with different 
numbers of points, say N a in subregion a and N b = N — N a in subregion b. Let us do so 
now. Then the variance of the estimator is 


Var ((/)') 


1 [ Vara!! Varfrjjj 

4 N a N-N a 


(7.8.13) 


which is minimized (one can easily verify) when 


= rra 

N a a +a b 


(7.8.14) 


Here we have adopted the shorthand notation cr a = [Var a (f)] 1 '' 2 , and correspondingly for b. 
If N a satisfies equation (7.8.14), then equation (7.8.13) reduces to 


Var ((/)') 



(7.8.15) 


Equation (7.8.15) reduces to equation (7.8.9) if Var (/) = Var a (/) = Var;, (/), in which case 
stratifying the sample makes no difference. 

A standard way to generalize the above result is to consider the volume V divided into 
more than two equal subregions. One can readily obtain the result that the optimal allocation of 
sample points among the regions is to have the number of points in each region j proportional 
to <jj (that is, the square root of the variance of the function / in that subregion). In spaces 
of high dimensionality (say d ^ 4) this is not in practice very useful, however. Dividing a 
volume into K segments along each dimension implies K d subvolumes, typically much too 
large a number when one contemplates estimating all the corresponding afs. 


Mixed Strategies 


Importance sampling and stratified sampling seem, at first sight, inconsistent with each 
other. The former concentrates sample points where the magnitude of the integrand |/| is 
largest, that latter where the variance of / is largest. How can both be right? 

The answer is that (like so much else in life) it all depends on what you know and how 
well you know it. Importance sampling depends on already knowing some approximation to 
your integral, so that you are able to generate random points Xi with the desired probability 
density p. To the extent that your p is not ideal, you are left with an error that decreases 
only as N “V 2 . Things are particularly bad if your p is far from ideal in a region where the 
integrand / is changing rapidly, since then the sampled function h = f/p will have a large 
variance. Importance sampling works by smoothing the values of the sampled function h, and 
is effective only to the extent that you succeed in this. 

Stratified sampling, by contrast, does not necessarily require that you know anything 
about /. Stratified sampling works by smoothing out the fluctuations of the number of points 
in subregions, not by smoothing the values of the points. The simplest stratified strategy, 
dividing V into N equal subregions and choosing one point randomly in each subregion, 
already gives a method whose error decreases asymptotically as W _1 , much faster than 
IV -1 / 2 . (Note that quasi-random numbers, §7.7, are another way of smoothing fluctuations in 
the density of points, giving nearly as good a result as the “blind” stratification strategy.) 

However, “asymptotically” is an important caveat: For example, if the integrand is 
negligible in all but a single subregion, then the resulting one-sample integration is all but 
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useless. Information, even very crude, allowing importance sampling to put many points in 
the active subregion would be much better than blind stratified sampling. 

Stratified sampling really comes into its own if you have some way of estimating the 
variances, so that you can put unequal numbers of points in different subregions, according to 
(7.8.14) or its generalizations, and if you can find a way of dividing a region into a practical 
number of subregions (notably not K d with large dimension d), while yet significantly 
reducing the variance of the function in each subregion compared to its variance in the full 
volume. Doing this requires a lot of knowledge about /, though different knowledge from 
what is required for importance sampling. 

In practice, importance sampling and stratified sampling are not incompatible. In many, 
if not most, cases of interest, the integrand / is small everywhere in V except for a small 
fractional volume of “active regions.” In these regions the magnitude of |/| and the standard 
deviation o = [Var (/)] 1-/2 are comparable in size, so both techniques will give about the 
same concentration of points. In more sophisticated implementations, it is also possible to 
“nest” the two techniques, so that (e.g.) importance sampling on a crude grid is followed 
by stratification within each grid cell. 


Adaptive Monte Carlo: VEGAS 


The VEGAS algorithm, invented by Peter Lepage [1 >2], is widely used for multidimen¬ 
sional integrals that occur in elementary particle physics. VEGAS is primarily based on 
importance sampling, but it also does some stratified sampling if the dimension d is small 
enough to avoid K d explosion (specifically, if (K/2) d < N/2, with N the number of sample 
points). The basic technique for importance sampling in VEGAS is to construct, adaptively, 
a multidimensional weight function g that is separable , 

V « fl(*. V,z, ■ ■ ■) - 9 x(x)g y (y)g z (z)... (7.8.16) 

Such a function avoids the K d explosion in two ways: (i) It can be stored in the computer 
as d separate one-dimensional functions, each defined by K tabulated values, say — so that 
K x d replaces K d . (ii) It can be sampled as a probability density by consecutively sampling 
the d one-dimensional functions to obtain coordinate vector components (x,y,z,...). 

The optimal separable weight function can be shown to be [1 ] 


g x (x) oc 



dz 


jjE,y, =,•••) ] 1/2 
(»■■■. 


(7.8.17) 


(and correspondingly for y, z,.. .). Notice that this reduces to g oc |/| (7.8.6) in one 
dimension. Equation (7.8.17) immediately suggests VEGAS’ adaptive strategy: Given a 
set of ^-functions (initially all constant, say), one samples the function /, accumulating not 
only the overall estimator of the integral, but also the Kd estimators (K subdivisions of the 
independent variable in each of d dimensions) of the right-hand side of equation (7.8.17). 
These then determine improved g functions for the next iteration. 

When the integrand / is concentrated in one, or at most a few, regions in d-space, then 
the weight function g’s quickly become large at coordinate values that are the projections of 
these regions onto the coordinate axes. The accuracy of the Monte Carlo integration is then 
enormously enhanced over what simple Monte Carlo would give. 

The weakness of VEGAS is the obvious one: To the extent that the projection of the 
function / onto individual coordinate directions is uniform, VEGAS gives no concentration 
of sample points in those dimensions. The worst case for VEGAS, e.g., is an integrand that 
is concentrated close to a body diagonal line, e.g., one from (0,0,0,...) to (1,1,1,...). 
Since this geometry is completely nonseparable, VEGAS can give no advantage at all. More 
generally, VEGAS may not do well when the integrand is concentrated in one-dimensional 
(or higher) curved trajectories (or hypersurfaces), unless these happen to be oriented close 
to the coordinate directions. 


The routine vegas that follows is essentially Lepage’s standard version, minimally 
modified to conform to our conventions. (We thank Lepage for permission to reproduce the 
program here.) For consistency with other versions of the VEGAS algorithm in circulation. 
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we have preserved original variable names. The parameter NDMX is what we have called K, 
the maximum number of increments along each axis; MXDIM is the maximum value of d; some 
other parameters are explained in the comments. 

The vegas routine performs m = itmx statistically independent evaluations of the 
desired integral, each with N = ncall function evaluations. While statistically independent, 
these iterations do assist each other, since each one is used to refine the sampling grid for 
the next one. The results of all iterations are combined into a single best answer, and its 
estimated error, by the relations 


2best - XJ a 2 j 0,2 CTbest - ^ 


Also returned is the quantity 


2 / 1 (li ~ 1 

X m = -- > ^-; 

m — 1 a: 


(Ii - /best) 2 


(7.8.18) 


(7.8.19) 


If this is significantly larger than 1, then the results of the iterations are statistically 
inconsistent, and the answers are suspect. 

The input flag init can be used to advantage. One might have a call with init=0, 
ncall=1000, itmx=5 immediately followed by a call with init=l, ncall=100000, itmx=l. 
The effect would be to develop a sampling grid over 5 iterations of a small number of samples, 
then to do a single high accuracy integration on the optimized grid. 

Note that the user-supplied integrand function, fxn, has an argument wgt in addition 
to the expected evaluation point x. In most applications you ignore wgt inside the function. 
Occasionally, however, you may want to integrate some additional function or functions along 
with the principal function /. The integral of any such function g can be estimated by 


= Wi9 ( x ) 


(7.8.20) 


where the Wi’s and x’s are the arguments wgt and x, respectively. It is straightforward to 
accumulate this sum inside your function fxn, and to pass the answer back to your main 
program via global variables. Of course, g(x) had better resemble the principal function / to 
some degree, since the sampling will be optimized for /. 


#include <stdio.h> 

#include <math.h> 

#include "nrutil.h" 

#define ALPH 1.5 
#def ine NDMX 50 
#define MXDIM 10 
#define TINY 1.0e-30 

extern long idum; For random number initialization in main. 

void vegas(float regn[], int ndim, float (*fxn)(float [], float), int init, 
unsigned long ncall, int itmx, int nprn, float *tgral, float *sd, 
float *chi2a) 

Performs Monte Carlo integration of a user-supplied ndim-dimensional function fxn over a 
rectangular volume specified by regn[l. .2*ndim] , a vector consisting of ndim “lower left" 
coordinates of the region followed by ndim "upper right” coordinates. The integration consists 
of itmx iterations, each with approximately ncall calls to the function. After each iteration 
the grid is refined; more than 5 or 10 iterations are rarely useful. The input flag init signals 
whether this call is a new start, or a subsequent call for additional iterations (see comments 
below). The input flag nprn (normally 0) controls the amount of diagnostic output. Returned 
answers are tgral (the best estimate of the integral), sd (its standard deviation), and chi2a 
(x 2 per degree of freedom, an indicator of whether consistent results are being obtained). See 
text for further details. 

t 



float ran2(long *idum); 
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void rebin(float rc, int nd, float r[], float xin[], float xi[]); 
static int i,it, j ,k,mds,nd,ndo,ng,npg,ia[MXDIM+l] ,kg[MXDIM+l] ; 
static float calls,dv2g,dxg,f,f2,f2b,fb,rc,ti,tsi,wgt,xjac,xn,xnd,xo; 
static float d[NDMX+l] [MXDIM+1] ,di[NDMX+l] [MXDIM+1] ,dt[MXDIM+l] , 

dx[MXDIM+1], r[WDMX+1],x[MXDIM+l],xi[MXDIM+1][MDMX+1],xin[NDMX+l]; 
static double schi.si.swgt; 

Best make everything static, allowing restarts. 

if (init <= 0) { Normal entry. Enter here on a cold start. 

mds=ndo=l; Change to mds=0 to disable stratified sampling, 

for (j=l;j<=ndim; j++) xi[j] [1] =1.0; i.e., use importance sampling only. 

> 

if (init <= 1) si=swgt=schi=0.0; 

Enter here to inherit the grid from a previous call, but not its answers, 
if (init <= 2) { Enter here to inherit the previous grid and its 

nd=NDMX; answers. 

ng=l; 

if (mds) { Set up for stratification. 

ng=(int)pow(ncall/2.0+0.25,1.0/ndim); 
mds=l; 

if ((2*ng-NDMX) >= 0) { 
mds = -1; 
npg=ng/NDMX+l; 
nd=ng/npg; 
ng=npg*nd; 

> 

} 

for (k=l,i=l;i<=ndim;i++) k *= ng; 
npg=IMAX(ncall/k, 2); 
calls=(float)npg * (float)k; 
dxg=l.0/ng; 

for (dv2g=l,i=l; i<=ndim;i++) dv2g *= dxg; 

dv2g=SQR(calls*dv2g)/npg/npg/(npg-l.0); 

xnd=nd; 

dxg *= xnd; 

xj ac=l.0/calls; 

for (j=l;j<=ndim;j++) { 

dx[j]=regn[j +ndim]-regn[j]; 
xjac *= dx[j]; 

> 

if (nd != ndo) { Do binning if necessary, 

for (i=l;i<=IMAX(nd,ndo);i++) r[i]=1.0; 
for (j=l;j<=ndim;j++) rebin(ndo/xnd,nd,r,xin,xi[j]); 
ndo=nd; 

} 

if (nprn >= 0) { 

printf("'/.s: ndim= "/.3d ncall= "/,8.0f\n", 

" Input parameters for vegas",ndim,calls); 
printf ("'/.28s it="/.5d itmx='/.5d\n"," " ,it,itmx); 
printf ("'/.28s nprn='/,3d ALPH="/.5.2f \n", " ",nprn, ALPH); 
printf(""/.28s mds="/,3d nd="/,4d\n"," ",mds,nd); 
for (j=l;j<=ndim;j++) { 

printf(""/.30s xl['/.2d]= "/.11.4g xu["/,2d]= "/.11.4g\n", 

" " , j ,regn[j] , j ,regn[j+ndim] ) ; 

> 

> 

> 

for (it=l;it<=itmx;it++) { 

Main iteration loop. Can enter here (init > 3) to do an additional itmx iterations with 
all other parameters unchanged. 
ti=tsi=0.0; 

for (j=l;j<=ndim;j++) { 

kg[j]=l; 

for (i=l;i<=nd;i++) d[i] [j]=di[i] [j]=0.0; 



s o- i 
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> 

for (;;) { 

fb=f2b=0.0; 

for (k=l;k<=npg;k++) { 
wgt=xjac; 

for (j=l;j<=ndim;j++) { 

xn=(kg[j]-ran2(&idum))*dxg+l.0; 
ia[j]=IMAX(IMIN((int)(xn),NDMX),1); 
if (ia[j] > 1) { 

xo=xi[j] [ia[j]]-xi[j] [ia[j]-l] ; 
rc=xi [j] [ia[j] -1] + (xn-ia[j] )*xo; 

> else { 

xo=xi[j] [ia[j]] ; 
rc=(xn-ia[ j ])*xo; 

> 

x[j]=regn[j]+rc*dx [ j]; 
wgt *= xo*xnd; 

> 

f=wgt*(*fxn)(x,wgt); 
f2=f*f; 
fb += f; 
f 2b += f 2; 

for (j=l;j<=ndim;j++) { 
di [ia[j]] [j] += f; 
if (mds >= 0) d[ia[j]] [j] += f2; 


> 

f2b=sqrt(f2b*npg); 
f2b=(f2b-fb)*(f2b+fb); 
if (f2b <= 0.0) f2b=TINY; 
ti += fb; 
tsi += f2b; 

if (mds < 0) { Use stratified sampling, 

for (j=l;j<=ndim;j++) d[ia[j]][j] += f2b; 

> 

for (k=ndim;k>=l;k—) { 
kg[k] '/.= ng; 

if (++kg[k] != 1) break; 

> 


if (k < 1) break; 

> 

tsi *= dv2g; 
wgt=1.0/tsi; 


Compute final results for this iteration. 



schi += wgt*ti*ti; 
swgt += wgt; 

*tgral=si/swgt; 

*chi2a=(schi-si*(*tgral))/(it-0.9999); 
if (*chi2a < 0.0) *chi2a = 0.0; 

*sd=sqrt(1.0/swgt); 
tsi=sqrt(tsi); 
if (nprn >= 0) { 

printf("'/.s '/,3d : integral = "/,14.7g +/- '/„9.2g\n", 

" iteration no. 11 ,it,ti,tsi); 

printf(""/,s integral ='/,14.7g+/-"/,9.2g chi**2/IT n = ’/,9.2g\n", 
" all iterations: ",*tgral,*sd,*chi2a); 
if (nprn) { 

for (j=l;j<=ndim;j++) { 

printfC DATA FOR axis '/,2d\n", j) ; 
printf ("’/,6s"/,13s"/,lls"/,13s'/,lls'/,13s\n", 

"X","delta i","X","delta i","X","delta i"); 
for (i=l+nprn/2;i<=nd;i += nprn+2) { 

printf ("'/,8.5f '/.12.4g'/„12. Bf '/.12.4g'/„12.5f'/„12.4g\n", 
xi[j] [i] ,di[i] [j],xi[j][i+l]. 



Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 




7.8 Adaptive and Recursive Monte Carlo Methods 


323 


> 

> 


> 

for 


> 

for 


> 


> 

> 


> 


di[i+l] [j] ,xi[j] [i+2] ,di[i+2] Cjl) ; 


(j=l;j<=ndim;j++) { 
xo=d[l] [j] ; 
xn=d[2] [j]; 
d[ID [j] = (xo+xn)/2.0; 
dt[j]=d[l] [j] ; 
for (i=2;i<nd;i++) { 
rc=xo+xn; 
xo=xn; 

xn=d[i+l] [j]; 

d[i] [j] = (rc+xn)/3.0; 

dt[j] += d[i] [j]; 

> 

d[nd] [j] = (xo+xn)/2.0; 
dt [j] += d[nd] [j] ; 


Refine the grid. Consult references to understand 
the subtlety of this procedure. The refine¬ 
ment is damped, to avoid rapid, destabiliz¬ 
ing changes, and also compressed in range 
by the exponent ALPH. 


(j=l;j<=ndim;j++) { 
rc=0.0; 

for (i=l;i<=nd;i++) { 

if (d[i] [j] < TINY) d[i] [j]=TINY; 
r[i] =pow((1.0-d[i] [j]/dt[j])/ 

(log(dt [j] )-log(d[i] [j] )),ALPH); 
rc += r [i] ; 

> 

rebin(rc/xnd,nd,r,xin,xi[j]); 


void rebinffloat rc, int nd, float r[], float xin[] , float xi[]) 

Utility routine used by vegas, to rebin a vector of densities xi into new bins defined by a 
vector r. 

< 

int i,k=0; 

float dr=0.0, xn=0.0, xo=0.0; 

for (i=l;i<nd;i++) { 
while (rc > dr) 
dr += r[++k]; 
if (k > 1) xo=xi [k-1] ; 
xn=xi [k] ; 
dr -= rc; 

xin[i]=xn-(xn-xo)*dr/r[k]; 

> 

for (i=l;i<nd;i++) xi [i]=xin[i]; 
xi [nd] =1.0; 


Recursive Stratified Sampling 



The problem with stratified sampling, we have seen, is that it may not avoid the K d 
explosion inherent in the obvious, Cartesian, tessellation of a d-dimensional volume. A 
technique called recursive stratified sampling [3] attempts to do this by successive bisections 
of a volume, not along all d dimensions, but rather along only one dimension at a time. 
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The starting points are equations (7.8.10) and (7.8.13), applied to bisections of successively 
smaller subregions. 

Suppose that we have a quota of N evaluations of the function /, and want to evaluate 
(f)' in the rectangular parallelepiped region R = (x„, x;,). (We denote such a region by the 
two coordinate vectors of its diagonally opposite comers.) First, we allocate a fraction p of 
N towards exploring the variance of / in R: We sample pN function values uniformly in 
R and accumulate the sums that will give the d different pairs of variances corresponding to 
the d different coordinate directions along which R can be bisected. In other words, in pN 
samples, we estimate Var (/) in each of the regions resulting from a possible bisection of R, 

R ai =(x a ,x b - • (x b - x a )ei) 

1 2 (7.8.21) 

R bi =(x a + -Ci ■ (x b - x a )ei,x b ) 

Here ei is the unit vector in the 7th coordinate direction, i = 1,2,... .d. 

Second, we inspect the variances to find the most favorable dimension i to bisect. By 
equation (7.8.15), we could, for example, choose that i for which the sum of the square roots 
of the variance estimators in regions R a i and R bl is minimized. (Actually, as we will explain, 
we do something slightly different.) 

Third, we allocate the remaining (1 — p)N function evaluations between the regions 
Rai and R b i. If we used equation (7.8.15) to choose i, we should do this allocation according 
to equation (7.8.14). 

We now have two parallelepipeds each with its own allocation of function evaluations for 
estimating the mean of /. Our “RSS” algorithm now shows itself to be recursive: To evaluate 
the mean in each region, we go back to the sentence beginning “First,...” in the paragraph 
above equation (7.8.21). (Of course, when the allocation of points to a region falls below 
some number, we resort to simple Monte Carlo rather than continue with the recursion.) 

Finally, we combine the means, and also estimated variances of the two subvolumes, 
using equation (7.8.10) and the first line of equation (7.8.11). 

This completes the RSS algorithm in its simplest form. Before we describe some 
additional tricks under the general rubric of “implementation details,” we need to return briefly 
to equations (7.8.13)—(7.8.15) and derive the equations that we actually use instead of these. 
The right-hand side of equation (7.8.13) applies the familiar scaling law of equation (7.8.9) 
twice, once to a and again to b. This would be correct if the estimates (f) a and (f) b were 
each made by simple Monte Carlo, with uniformly random sample points. However, the two 
estimates of the mean are in fact made recursively. Thus, there is no reason to expect equation 
(7.8.9) to hold. Rather, we might substitute for equation (7.8.13) the relation. 


Var ((/)') = ’ 


Var 0 (/) Var b (/) 
(TV - TV„)“ 


TV“ 


(7.8.22) 


where a is an unknown constant > 1 (the case of equality corresponding to simple Monte 
Carlo). In that case, a short calculation shows that Var ((/)') is minimized when 


N a _ Var a (/) 1/(1+a) 

~N ~ Var a (/) 1/(1+Q) + Var 6 (/) 1/(1+a) 
and that its minimum value is 

Var ((/)') oc [Var^/) 1 /^ +Var 6 (/) 1/(1+ “ ) ] 1+ ' 


(7.8.23) 


(7.8.24) 


Equations (7.8.22)-(7.8.24) reduce to equations (7.8.13)—(7.8.15) when a = 1. Numerical 
experiments to find a self-consistent value for a find that a « 2. That is, when equation 
(7.8.23) with a = 2 is used recursively to allocate sample opportunities, the observed variance 
of the RSS algorithm goes approximately as TV -2 , while any other value of a in equation 
(7.8.23) gives a poorer fall-off. (The sensitivity to a is, however, not very great; it is not 
known whether a = 2 is an analytically justifiable result, or only a useful heuristic.) 

The principal difference between miser’s implementation and the algorithm as described 
thus far lies in how the variances on the right-hand side of equation (7.8.23) are estimated. We 
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find empirically that it is somewhat more robust to use the square of the difference of maximum 
and minimum sampled function values, instead of the genuine second moment of the samples. 
This estimator is of course increasingly biased with increasing sample size; however, equation 
(7.8.23) uses it only to compare two subvolumes (o and b ) having approximately equal numbers 
of samples. The “max minus min” estimator proves its worth when the preliminary sampling 
yields only a single point, or small number of points, in active regions of the integrand. In 
many realistic cases, these are indicators of nearby regions of even greater importance, and it 
is useful to let them attract the greater sampling weight that “max minus min” provides. 

A second modification embodied in the code is the introduction of a “dithering parameter,” 
dith, whose nonzero value causes subvolumes to be divided not exactly down the middle, but 
rather into fractions 0.5±dith, with the sign of the ± randomly chosen by a built-in random 
number routine. Normally dith can be set to zero. However, there is a large advantage in 
taking dith to be nonzero if some special symmetry of the integrand puts the active region 
exactly at the midpoint of the region, or at the center of some power-of-two submultiple of 
the region. One wants to avoid the extreme case of the active region being evenly divided 
into 2 d abutting comers of a d-dimensional space. A typical nonzero value of dith, on 
those occasions when it is useful, might be 0.1. Of course, when the dithering parameter 
is nonzero, we must take the differing sizes of the subvolumes into account; the code does 
this through the variable fracl. 

One final feature in the code deserves mention. The RSS algorithm uses a single set 
of sample points to evaluate equation (7.8.23) in all d directions. At bottom levels of the 
recursion, the number of sample points can be quite small. Although rare, it can happen that 
in one direction all the samples are in one half of the volume; in that case, that direction 
is ignored as a candidate for bifurcation. Even more rare is the possibility that all of the 
samples are in one half of the volume in all directions. In this case, a random direction is 
chosen. If this happens too often in your application, then you should increase MNPT (see 
line if (!jb)... in the code). 

Note that miser, as given, returns as ave an estimate of the average function value 
((/)), not the integral of / over the region. The routine vegas, adopting the other convention, 
returns as tgral the integral. The two conventions are of course trivially related, by equation 
(7.8.8), since the volume V of the rectangular region is known. 


#include <stdlib.h> 

#include <math.h> 

#include "nrutil.h" 

#define PFAC 0.1 
#define MNPT 15 
#define MNBS 60 
#define TINY 1.0e-30 
#define BIG 1.0e30 

Here PFAC is the fraction of remaining function evaluations used at each stage to explore the 
variance of func. At least MNPT function evaluations are performed in any terminal subregion; 
a subregion is further bisected only if at least MNBS function evaluations are available. We take 
MNBS = 4*MNPT. 

static long iran=0; 

void miser (float (*func) (float []), float regn[] , int ndim, unsigned long npts, 
float dith, float *ave, float *var) 

Monte Carlo samples a user-supplied ndim-dimensional function func in a rectangular volume 
specified by regn[l. ,2*ndim] , a vector consisting of ndim “lower-left" coordinates of the 
region followed by ndim "upper-right” coordinates. The function is sampled a total of npts 
times, at locations determined by the method of recursive stratified sampling. The mean value 
of the function in the region is returned as ave; an estimate of the statistical uncertainty of ave 
(square of standard deviation) is returned as var. The input parameter dith should normally 
be set to zero, but can be set to (e.g.) 0.1 if func's active region falls on the boundary of a 
power-of-two subdivision of region. 

{ 

void ranpt (float pt[], float regn[], int n); 
float *regn_temp; 
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unsigned long n,npre,nptl,nptr; 
int j,jb; 
float avel.varl; 
float fracl.fval; 

float rgl, rgm, rgr, s, sigl, siglb, sigr, sigrb; 
float sum,sumb, summ,summ2; 
float *fmaxi,*fmaxr,*fmini,*fminr; 
float *pt,*rmid; 

pt=vector(1,ndim); 

if (npts < MNBS) { Too few points to bisect; do straight 

summ=summ2=0.0; Monte Carlo, 

for (n=l;n<=npts;n++) { 
ranpt(pt,regn,ndim); 
fval=(*func)(pt); 
summ += fval; 
summ2 += fval * fval; 

> 

*ave=summ/npts; 

*var=FMAX(TINY,(summ2-summ*summ/npts)/(npts*npts)); 

> 

else { Do the preliminary (uniform) sampling. 

rmid=vector(1,ndim); 

npre=LMAX((unsigned long)(npts*PFAC),MNPT); 

fmaxl=vector(1,ndim); 

fmaxr=vector(1,ndim); 

fminl=vector(1,ndim); 

fminr=vector(1,ndim); 

for (j=i;j<=ndim;j++) { Initialize the left and right bounds for 

iran=(iran*2661+36979) 7, 175000; each dimension. 

s=SIGN(dith,(float)(iran-87500)); 
rmid[j]=(0.5+s)*regn[j]+(0.5-s)*regn[ndim+j]; 
f mini [ j ] =f minr [ j ] =BIG; 
fmaxl [j]=fmaxr[j] = -BIG; 

} 

for (n=l;n<=npre;n++) { Loop over the points in the sample, 

ranpt(pt,regn,ndim); 
fval=(*func)(pt); 

for (j=l; j<=ndim; j++) { Find the left and right bounds for each 

if (pt[j]<=rmid[j]) { dimension, 

fminl [j]=FMIN(fminl[j] ,fval); 
fmaxi [j] =FMAX(fmaxi [j] ,fval); 

> 

else { 

fminr [j]=FMIN(fminr [j] ,fval); 
fmaxr [j]=FMAX(fmaxr [j] ,fval); 

> 

> 

> 

sumb=BIG; Choose which dimension jb to bisect. 

jb=0; 

siglb=sigrb=l.0; 

for (j=l;j<=ndim;j++) { 

if (fmaxl [j] > fminl [j] && fmaxr [j] > fminr [j]) { 
sigl=FMAX(TINY,pow(fmaxl[j]-fminl[j] , 2.0/3.0)); 
sigr=FMAX(TINY,pow(fmaxr[j]-fminr[j],2.0/3.0)); 
sum=sigl+sigr; Equation (7.8.24), see text, 

if (sum<=sumb) { 
sumb=sum; 

i b= j; 

siglb=sigl; 
sigrb=sigr; 

> 

> 
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> 

free_vector(fminr,1,ndim); 
free_vector(fmini,1,ndim); 
free_vector(fmaxr,1,ndim); 
free_vector(fmaxi,1,ndim); 

if (!jb) jb=l+(ndim*iran)/175000; MNPT may be too small. 
rgl=regn[jb] ; Apportion the remaining points between 

rgm=rmid[jb] ; left and right. 

rgr=regn[ndim+jb]; 
fracl=fabs((rgm-rgl)/(rgr-rgl)); 

nptl=(unsigned long)(MNPT+(npts-npre-2*MNPT)*fracl*siglb 

/(fracl*siglb+(1.0-fracl)*sigrb)) ; Equation (7.8.23). 

nptr=npts-npre-nptl; 

regn_temp=vector(l,2*ndim) ; Now allocate and integrate the two sub- 

for (j=l;j<=ndim;j++) { regions. 

regn_temp[j]=regn[j] ; 
regn_temp[ndim+j]=regn[ndim+j]; 

> 

regn_temp[ndim+jb]=rmid[jb] ; 

miser(func,regn_temp,ndim,nptl,dith,&avel,fevarl); 

regn_temp[jb]=rmid[jb] ; Dispatch recursive call; will return back 

regn_temp[ndim+jb]=regn[ndim+jb] ; here eventually, 

miser(func,regn_temp,ndim,nptr,dith,ave,var); 
free.vector(regn_temp,1,2*ndim); 

*ave=fracl*avel+(l-fracl)*(*ave); 

*var=fracl*fracl*varl+(l-fracl)*(l-fracl)*(*var); 

Combine left and right regions by equation (7.8.11) (1st line). 
free_vector(rmid,l,ndim); 

> 

free_vector(pt,1,ndim); 

> 


The miser routine calls a short function ranpt to get a random point within a specified 
d-dimensional region. The following version of ranpt makes consecutive calls to a uniform 
random number generator and does the obvious scaling. One can easily modify ranpt to 
generate its points via the quasi-random routine sobseq (§7.7). We find that miser with 
sobseq can be considerably more accurate than miser with uniform random deviates. Since 
the use of RSS and the use of quasi-random numbers are completely separable, however, we 
have not made the code given here dependent on sobseq. A similar remark might be made 
regarding importance sampling, which could in principle be combined with RSS. (One could 
in principle combine vegas and miser, although the programming would be intricate.) 

extern long idum; 

void ranpt (float pt[], float regn[] , int n) 

Returns a uniformly random point pt in an n-dimensional rectangular region. Used by miser; 
calls rani for uniform deviates. Your main program should initialize the global variable idum 
to a negative seed integer, 
f 

float rani(long *idum); 
int j; 

for (j=l;j<=n;j++) 

pt[j1=regn[j]+(regn[n+j]-regn[j])*ranl(&idum); 

> 
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8.0 Introduction 

This chapter almost doesn’t belong in a book on numerical methods. However, 
some practical knowledge of techniques for sorting is an indispensable part of any 
good programmer’s expertise. We would not want you to consider yourself expert in 
numerical techniques while remaining ignorant of so basic a subject. 

In conjunction with numerical work, sorting is frequently necessary when data 
(either experimental or numerically generated) are being handled. One has tables 
or lists of numbers, representing one or more independent (or “control”) variables, 
and one or more dependent (or “measured”) variables. One may wish to arrange 
these data, in various circumstances, in order by one or another of these variables. 
Alternatively, one may simply wish to identify the “median” value, or the “upper 
quartile” value of one of the lists of values. This task, closely related to sorting, 
is called selection. 

Here, more specifically, are the tasks that this chapter will deal with: 

• Sort, i.e., rearrange, an array of numbers into numerical order. 

• Rearrange an array into numerical order while performing the corre¬ 
sponding rearrangement of one or more additional arrays, so that the 
correspondence between elements in all arrays is maintained. 

• Given an array, prepare an index table for it, i.e., a table of pointers telling 
which number array element comes first in numerical order, which second, 
and so on. 

• Given an array, prepare a rank table for it, i.e., a table telling what is 
the numerical rank of the first array element, the second array element, 
and so on. 

• Select the Mth largest element from an array. 

For the basic task of sorting N elements, the best algorithms require on the 
order of several times N log 2 N operations. The algorithm inventor tries to reduce 
the constant in front of this estimate to as small a value as possible. Two of the 
best algorithms are Quicksort (§8.2), invented by the inimitable C.A.R. Hoare, and 
Heapsort (§8.3), invented by J.W.J. Williams. 

For large N (say > 1000), Quicksort is faster, on most machines, by a factor of 
1.5 or 2; it requires a bit of extra memory, however, and is a moderately complicated 
program. Heapsort is a true “sort in place,” and is somewhat more compact to 
program and therefore a bit easier to modify for special purposes. On balance, we 
recommend Quicksort because of its speed, but we implement both routines. 
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For small N one does better to use an algorithm whose operation count goes 
as a higher, i.e., poorer, power of N, if the constant in front is small enough. For 
N < 20, roughly, the method of straight insertion (§8.1) is concise and fast enough. 
We include it with some trepidation: It is an N 2 algorithm, whose potential for 
misuse (by using it for too large an N) is great. The resultant waste of computer 
time is so awesome, that we were tempted not to include any N 2 routine at all. We 
will draw the line, however, at the inefficient N 2 algorithm, beloved of elementary 
computer science texts, called bubble sort. If you know what bubble sort is, wipe it 
from your mind; if you don’t know, make a point of never finding out! 

For N < 50, roughly. Shell’s method (§8.1), only slightly more complicated to 
program than straight insertion, is competitive with the more complicated Quicksort 
on many machines. This method goes as N 3 / 2 in the worst case, but is usually faster. 

See references [1,2] for further information on the subject of sorting, and for 
detailed references to the literature. 

CITED REFERENCES AND FURTHER READING: 

Knuth, D.E. 1973, Sorting and Searching, vol. 3 of The Art of Computer Programming (Reading, 
MA: Addison-Wesley). [1] 

Sedgewick, R. 1988, Algorithms, 2nd ed. (Reading, MA: Addison-Wesley), Chapters 8-13. [2] 


8.1 Straight Insertion and Shell’s Method 


Straight insertion is an N 2 routine, and should be used only for small N, 
say < 20. 

The technique is exactly the one used by experienced card players to sort then- 
cards: Pick out the second card and put it in order with respect to the first; then pick 
out the third card and insert it into the sequence among the first two; and so on until 
the last card has been picked out and inserted. 

void piksrt(int n, float arr[]) 

Sorts an array arr[l. .n] into ascending numerical order, by straight insertion, n is input; arr 
is replaced on output by its sorted rearrangement. 

{ 

int i, j ; 
float a; 

for (j=2; j<=n; j++) { Pick out each element in turn. 

a=arr[j] ; 

i=j-i; 

while (i > 0 kk arr[i] > a) { Look for the place to insert it. 
arr [i+1] =arr [i] ; 
i—; 

> 

arr [i+1] =a; Insert it. 

> 

> 



What if you also want to rearrange an array brr at the same time as you sort 
arr? Simply move an element of brr whenever you move an element of arr: 
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void piksr2(int n, float arr[], float brr[]) 

Sorts an array arr[l. .n] into ascending numerical order, by straight insertion, while making 
the corresponding rearrangement of the array brr [1. .n] . 

{ 

int i, j ; 
float a,b; 


for (j=2; j<=n; j++) { Pick out each element in turn. 

a=arr [j] ; 
b=brr [j] ; 

while (i > 0 kk arr[i] > a) { Look for the place to insert it. 
arr [i+1] =arr [i] ; 
brr[i+1]=brr [i] ; 
i—; 

} 

arr [i+1] =a; Insert it. 

brr [i+1] =b; 

> 

> 


For the case of rearranging a larger number of arrays by sorting on one of 
them, see §8.4. 

Shell’s Method 

This is actually a variant on straight insertion, but a very powerful variant indeed. 
The rough idea, e.g., for the case of sorting 16 numbers n 1 ... ri+ 6 , is this: First sort, 
by straight insertion, each of the 8 groups of 2 (ni,ng), (ri 2 , nio), ..., (ns, ni6). 
Next, sort each of the 4 groups of 4 (m, n$, ng, 1113 ), ..., (7*4, ns, n± 2 ,nie). Next 
sort the 2 groups of 8 records, beginning with (n 1 , 113 ,ns, 719 , 7111 , 7113 , 71 x 5 ). 
Finally, sort the whole list of 16 numbers. 

Of course, only the last sort is necessary for putting the numbers into order. So 
what is the purpose of the previous partial sorts? The answer is that the previous 
sorts allow numbers efficiently to filter up or down to positions close to their final 
resting places. Therefore, the straight insertion passes on the final sort rarely have to 
go past more than a “few” elements before finding the right place. (Think of sorting 
a hand of cards that are already almost in order.) 

The spacings between the numbers sorted on each pass through the data (8,4,2,1 
in the above example) are called the increments , and a Shell sort is sometimes 
called a diminishing increment sort. There has been a lot of research into how to 
choose a good set of increments, but the optimum choice is not known. The set 
..., 8 ,4,2,1 is in fact not a good choice, especially for N a power of 2. A much 
better choice is the sequence 

(3 fc — l)/2,..., 40,13,4,1 (8.1.1) 

which can be generated by the recurrence 

*i = l, 4+1=34 + 1 , k— 1 , 2 ,... ( 8 . 1 . 2 ) 

It can be shown (see [1 ]) that for this sequence of increments the number of operations 
required in all is of order N 3 / 2 for the worst possible ordering of the original data. 



S, § g 
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For “randomly” ordered data, the operations count goes approximately as N 1 ' 25 , at 
least for N < 60000. For N > 50, however, Quicksort is generally faster. The 
program follows: 


void shell (unsigned long n, float a[]) 

Sorts an array a[] into ascending numerical order by Shell’s method (diminishing increment 
sort), a is replaced on output by its sorted rearrangement. Normally, the argument n should 
be set to the size of array a, but if n is smaller than this, then only the first n elements of a 
are sorted. This feature is used in selip. 

{ 

unsigned long i,j,inc; 
float v; 

inc=l; Determine the starting increment, 

do { 

inc *= 3; 
inc++; 

> while (inc <= n); 

do { Loop over the partial sorts, 

inc /= 3; 

for (i=inc+l;i<=n;i++) { Outer loop of straight insertion. 

v=a[i] ; 


while (a[j-inc] > v) { Inner loop of straight insertion. 

a[j]=a[j-inc] ; 

j -= i nc » 

if (j <= inc) break; 

> 

a[j]=v; 

> 

> while (inc > 1); 

> 


CITED REFERENCES AND FURTHER READING: 

Knuth, D.E. 1973, Sorting and Searching, vol. 3 of The Art of Computer Programming (Reading, 
MA: Addison-Wesley), §5.2.1. [1] 

Sedgewick, R. 1988, Algorithms, 2nd ed. (Reading, MA: Addison-Wesley), Chapter 8. 


8.2 Quicksort 

Quicksort is, on most machines, on average, for large N, the fastest known 
sorting algorithm. It is a “partition-exchange” sorting method: A “partitioning 
element” a is selected from the array. Then by pairwise exchanges of elements, the 
original array is partitioned into two subarrays. At the end of a round of partitioning, 
the element a is in its final place in the array. All elements in the left subarray are 
< a, while all elements in the right subarray are > a. The process is then repeated 
on the left and right subarrays independently, and so on. 

The partitioning process is carried out by selecting some element, say the 
leftmost, as the partitioning element a. Scan a pointer up the array until you find 
an element > a, and then scan another pointer down from the end of the array 
until you find an element < a. These two elements are clearly out of place for the 
final partitioned array, so exchange them. Continue this process until the pointers 
cross. This is the right place to insert a, and that round of partitioning is done. The 
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question of the best strategy when an element is equal to the partitioning element 
is subtle; we refer you to Sedgewick [1 ] for a discussion. (Answer: You should 
stop and do an exchange.) 

For speed of execution, we do not implement Quicksort using recursion. Thus 
the algorithm requires an auxiliary array of storage, of length 2 log 2 N, which it uses 
as a push-down stack for keeping track of the pending subarrays. When a subarray 
has gotten down to some size M, it becomes faster to sort it by straight insertion 
(§8.1), so we will do this. The optimal setting of M is machine dependent, but 
M = 7 is not too far wrong. Some people advocate leaving the short subarrays 
unsorted until the end, and then doing one giant insertion sort at the end. Since 
each element moves at most 7 places, this is just as efficient as doing the sorts 
immediately, and saves on the overhead. However, on modern machines with paged 
memory, there is increased overhead when dealing with a large array all at once. We 
have not found any advantage in saving the insertion sorts till the end. 

As already mentioned, Quicksort’s average running time is fast, but its worst 
case running time can be very slow: For the worst case it is, in fact, an N 2 method! 
And for the most straightforward implementation of Quicksort it turns out that the 
worst case is achieved for an input array that is already in order! This ordering 
of the input array might easily occur in practice. One way to avoid this is to use 
a little random number generator to choose a random element as the partitioning 
element. Another is to use instead the median of the first, middle, and last elements 
of the current subarray. 

The great speed of Quicksort comes from the simplicity and efficiency of its 
inner loop. Simply adding one unnecessary test (for example, a test that your pointer 
has not moved off the end of the array) can almost double the running time! One 
avoids such unnecessary tests by placing “sentinels” at either end of the subarray 
being partitioned. The leftmost sentinel is < a, the rightmost > a. With the 
“median-of-three” selection of a partitioning element, we can use the two elements 
that were not the median to be the sentinels for that subarray. 

Our implementation closely follows [1]: 

#include "nrutil.h" 

#define SWAP(a,b) temp=(a);(a)=(b);(b)=temp; 

#define M 7 
#define NSTACK 50 

Here M is the size of subarrays sorted by straight insertion and NSTACK is the required auxiliary 
storage. 

void sort(unsigned long n, float arr[]) 

Sorts an array arr[l. .n] into ascending numerical order using the Quicksort algorithm, n is 
input; arr is replaced on output by its sorted rearrangement. 

{ 

unsigned long i,ir=n,j,k,l=l,*istack; 
int jstack=0; 
float a,temp; 

istack=lvector(l.NSTACK); 
for (;;) { 

if (ir-1 < M) { 

for (j=l+l;j<=ir;j++) { 
a=arr[j]; 

for (i=j —1; i>=l; i—) { 

if (arr[i] <= a) break; 
arr [i+1] =arr [i] ; 



Insertion sort when subarray small enough. 
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> 

arr [i+1] =a; 

> 

if (jstack == 0) break; 
ir=istack[j stack—] ; 
l=istack[j stack—] ; 

} else { 

k=(l+ir) » 1; 

SWAP (arr [k] , arr [1+1]) 
if (arr[1] > arr[ir]) { 

SWAP (arr [1] , arr [ir] ) 

> 

if (arr [1+1] > arr[ir]) { 

SWAP (arr [1+1] , arr [ir] ) 

> 

if (arr[l] > arr [1+1]) { 

SWAP (arr [1] , arr [1+1]) 

> 

i=l+l; Initialize pointers for partitioning. 

j=ir; 

a=arr[l+l] ; Partitioning element. 

for (;;) { Beginning of innermost loop. 

do i++; while (arr[i] < a); Scan up to find element > a. 

do j —; while (arr[j] > a); Scan down to find element < a. 

if (j < i) break; Pointers crossed. Partitioning complete. 

SWAP (arr [i] , arr [j]); Exchange elements. 

> End of innermost loop, 

arr [l+l]=arr [j] ; Insert partitioning element, 

arr [j]=a; 

jstack += 2; 

Push pointers to larger subarray on stack, process smaller subarray immediately, 
if (]stack > NSTACK) nrerrorO'NSTACK too small in sort."); 
if (ir-i+1 >= j-1) { 
istack[j stack] =ir; 
istack[j stack-1]=i; 
ir=j-l; 

> else { 

istack[j stack]=j-1; 
istack[j stack-1]=1; 

1-i; 

> 

> 

> 

free.lvector(istack,1,NSTACK); 


Pop stack and begin a new round of parti¬ 
tioning. 

Choose median of left, center, and right el¬ 
ements as partitioning element a. Also 
rearrange so that a[l] < a [1+1] < a[ir] . 



As usual you can move any other arrays around at the same time as you sort 
arr. At the risk of being repetitious: 

#include "nrutil.h" 

#define SWAP(a,b) temp=(a);(a)=(b);(b)=temp; 

#define M 7 
#define NSTACK 50 

void sort2(unsigned long n, float arr[], float brr[]) 

Sorts an array arr [1. .n] into ascending order using Quicksort, while making the corresponding 
rearrangement of the array brr[l. .n] . 

{ 

unsigned long i,ir=n,j,k,l=l,*istack; 
int jstack=0; 
float a,b,temp; 
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istack=lvector(l,NSTACK); 

for (;;) { Insertion sort when subarray small enough, 

if (ir-1 < M) { 

for (j=l+l;j<=ir;j++) { 


b=brr [j] ; 

for (i=j-l;i>=l;i—) { 

if (arr[i] <= a) break; 
arr [i+1] =arr [i] ; 
brr [i+1] =brr [i] ; 

> 

arr[i+1]=a; 
brr[i+1]=b; 

> 

if (Ijstack) { 

free.lvector(istack,1,NSTACK); 
return; 

> 

ir=istack[j stack] ; Pop stack and begin a new round of parti- 

l=istack[jstack-1] ; tioning. 

jstack -= 2; 

> else { 

k=(l+ir) » 1; Choose median of left, center and right el- 

SWAP(arr[k],arr[1+1]) ements as partitioning element a. Also 

SWAP (brr [k] ,brr [1+1]) rearrange so that a[l] < a[l+l] <a[ir], 

if (arr[l] > arr[ir]) { 

SWAP (arr [1] , arr [ir] ) 

SWAP (brr [1] ,brr[ir]) 

> 

if (arr [1+1] > arr[ir]) { 

SWAP (arr [1+1] , arr [ir] ) 

SWAP (brr [1+1] ,brr[ir]) 

> 

if (arr[l] > arr [1+1]) { 

SWAP (arr [1] , arr [1+1]) 

SWAP (brr [1] , brr [1+1]) 

> 

i=l+l; Initialize pointers for partitioning. 

j=ir; 

a=arr[l+l] ; Partitioning element. 

b=brr[1+1] ; 

for (;;) { Beginning of innermost loop. 

do i++; while (arr[i] < a); Scan up to find element > a. 

do j —; while (arr[j] > a); Scan down to find element < a. 

if (j < i) break; Pointers crossed. Partitioning complete. 

SWAP(arr[i] ,arr[j]) Exchange elements of both arrays. 

SWAP (brr [i] ,brr[j]) 

> End of innermost loop. 

arr [l+l]=arr [j] ; Insert partitioning element in both arrays. 

arr [j]=a; 

brr [1+1] =brr [j] ; 

brr [j]=b; 

jstack += 2; 

Push pointers to larger subarray on stack, process smaller subarray immediately, 
if (]stack > NSTACK) nrerror("NSTACK too small in sort2."); 
if (ir-i+1 >= j-1) { 
istack[j stack] =ir; 
istack [ j stack-1] =i; 

ir=j-l; 

> else { 

istack[j stack]=j-1; 
istack[j stack-1]=1; 



> 
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> 


> 

> 


You could, in principle, rearrange any number of additional arrays along with 
brr, but this becomes wasteful as the number of such arrays becomes large. The 
preferred technique is to make use of an index table, as described in §8.4. 


CITED REFERENCES AND FURTHER READING: 

Sedgewick, R. 1978, Communications of the ACM, vol. 21, pp. 847-857. [1] 


8.3 Heapsort 

While usually not quite as fast as Quicksort, Heapsort is one of our favorite 
sorting routines. It is a true “in-place” sort, requiring no auxiliary storage. It is an 
N log 2 N process, not only on average, but also for the worst-case order of input data. 
In fact, its worst case is only 20 percent or so worse than its average running time. 

It is beyond our scope to give a complete exposition on the theory of Heapsort. 
We will mention the general principles, then let you refer to the references [1, 2 ], or 
analyze the program yourself, if you want to understand the details. 

A set of N numbers a,, i = 1 ,.... TV, is said to form a “heap” if it satisfies 
the relation 


a,j /2 > a,j for 1 < j/2 < j < N (8.3.1) 

Here the division in j/2 means “integer divide,” i.e., is an exact integer or else is 
rounded down to the closest integer. Definition (8.3.1) will make sense if you think 
of the numbers a* as being arranged in a binary tree, with the top, “boss,” node being 
or, the two “underling” nodes being d 2 and 0 , 3 , their four underling nodes being a 4 
through 07 , etc. (See Figure 8.3.1.) In this form, a heap has every “supervisor” greater 
than or equal to its two “supervisees,” down through the levels of the hierarchy. 

If you have managed to rearrange your array into an order that forms a heap, 
then sorting it is very easy: You pull off the “top of the heap,” which will be the 
largest element yet unsorted. Then you “promote” to the top of the heap its largest 
underling. Then you promote its largest underling, and so on. The process is like 
what happens (or is supposed to happen) in a large corporation when the chairman 
of the board retires. You then repeat the whole process by retiring the new chairman 
of the board. Evidently the whole thing is an N log 2 N process, since each retiring 
chairman leads to log 2 N promotions of underlings. 

Well, how do you arrange the array into a heap in the first place? The answer 
is again a “sift-up” process like corporate promotion. Imagine that the corporation 
starts out with N/2 employees on the production line, but with no supervisors. Now 
a supervisor is hired to supervise two workers. If he is less capable than one of 
his workers, that one is promoted in his place, and he joins the production line. 
After supervisors are hired, then supervisors of supervisors are hired, and so on up 



S, § g 
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Figure 8.3.1. Ordering implied by a “heap,” here of 12 elements. Elements connected by an upward 
path are sorted with respect to one another, but there is not necessarily any ordering among elements 
related only “laterally.” 


the corporate ladder. Each employee is brought in at the top of the tree, but then 
immediately sifted down, with more capable workers promoted until their proper 
corporate level has been reached. 

In the Heapsort implementation, the same “sift-up” code can be used for the 
initial creation of the heap and for the subsequent retirement-and-promotion phase. 
One execution of the Heapsort function represents the entire life-cycle of a giant 
corporation: N /2 workers are hired; N /2 potential supervisors are hired; there is a 
sifting up in the ranks, a sort of super Peter Principle: in due course, each of the 
original employees gets promoted to chairman of the board. 


void hpsort(unsigned long n, float ra[]) 

Sorts an array ra[l. .n] into ascending numerical order using the Heapsort algorithm, n is 
input; ra is replaced on output by its sorted rearrangement. 

{ 

unsigned long i,ir,j,l; 
float rra; 


if (n < 2) return; 
l=(n » 1)+1; 
ir=n; 

The index 1 will be decremented from its initial value down to 1 during the “hiring” (heap 
creation) phase. Once it reaches 1, the index ir will be decremented from its initial value 
down to 1 during the "retirement-and-promotion” (heap selection) phase, 
for (;;) { 

Still in hiring phase. 


if (1 > 1) { 
rra=ra[—1] ; 

> else { 

rra=ra[ir] ; 
ra[ir]=ra[l] ; 
if (—ir == 1) { 
ra[l]=rra; 
break; 

> 

> 

i=l; 

j=i+i; 

while (j <= ir) { 


In retirement-and-promotion phase. 
Clear a space at end of array. 

Retire the top of the heap into it. 
Done with the last promotion. 

The least competent worker of all! 


if (j < ir && ra[j] < ra[j+l]) j++; 
if (rra < ra[j]) { Demote rra. 

ra[i]=ra[j] ; 


Whether in the hiring phase or promotion phase, we 
here set up to sift down element rra to its proper 
level. 

Compare to the better underling. 
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i= j; 

j «= i; 

> else break; Found rra's level. Terminate the sift-down. 

> 

ra[i]=rra; Put rra into its slot. 

} 

> 


CITED REFERENCES AND FURTHER READING: 

Knuth, D.E. 1973, Sorting and Searching, vo\. 3 ot The Art of Computer Programming (Reading, 
MA: Addison-Wesley), §5.2.3. [1] 

Sedgewick, R. 1988, Algorithms, 2nd ed. (Reading, MA: Addison-Wesley), Chapter 11. [2] 


8.4 Indexing and Ranking 

The concept of keys plays a prominent role in the management of data files. A 
data record in such a file may contain several items, or fields. For example, a record 
in a file of weather observations may have fields recording time, temperature, and 
wind velocity. When we sort the records, we must decide which of these fields we 
want to be brought into sorted order. The other fields in a record just come along 
for the ride, and will not, in general, end up in any particular order. The field on 
which the sort is performed is called the key field. 

For a data file with many records and many fields, the actual movement of N 
records into the sorted order of their keys Ki, i — 1 ...., N, can be a daunting task. 
Instead, one can construct an index table Ij, j = \...., N, such that the smallest 
K, has % = Ji , the second smallest has i = I-i, and so on up to the largest K, with 
i = In- In other words, the array 

K I} j = 1,2,..., N (8.4.1) 

is in sorted order when indexed by j. When an index table is available, one need not 
move records from their original order. Further, different index tables can be made 
from the same set of records, indexing them to different keys. 

The algorithm for constructing an index table is straightforward: Initialize the 
index array with the integers from 1 to N, then perform the Quicksort algorithm, 
moving the elements around as if one were sorting the keys. The integer that initially 
numbered the smallest key thus ends up in the number one position, and so on. 

#include "nrutil.h" 

#define SWAP(a,b) itemp=(a);(a)=(b);(b)=itemp; 

#define M 7 
#define NSTACK 50 

void indexx(unsigned long n, float arr[], unsigned long indx[]) 

Indexes an array arr [1. .n] , i.e., outputs the array indx[l. .n] such that arr[indx[j]] is 
in ascending order for j = 1, 2,..., N. The input quantities n and arr are not changed. 

{ 

unsigned long i,indxt,ir=n,itemp,j,k,l=l; 
int jstack=0,*istack; 
float a; 



istack=ivector(1,NSTACK); 
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original index rank sorted 

array table table array 



(a) (b) (c) (d) 


Figure 8.4.1. (a) An unsorted array of six numbers, (b) Index table, whose entries are pointers to the 

elements of (a) in ascending order, (c) Rank table, whose entries are the ranks of the corresponding 
elements of (a), (d) Sorted array of the elements in (a). 


for (j=l;j<=n;j++) indx[j]=j; 
for (;;) { 

if (ir-1 < M) { 

for (j=l+l;j<=ir;j++) { 
indxt=indx[j] ; 
a=arr[indxt]; 
for (i=j—1;i>=l;i—) { 

if (arr[indx[i]] <= a) break; 
indx[i+1]=indx[i]; 

> 

indx[i+1]=indxt; 

> 

if (jstack == 0) break; 
ir=istack[j stack—] ; 
l=istack[j stack—] ; 

> else { 

k=(l+ir) » 1; 

SWAP (indx [k] , indx [1+1]); 
if (arr[indx [1]] > arr[indx[ir]]) { 
SWAP(indx[1],indx [ir]) 

> 

if (arr[indx[1+1]] > arr[indx[ir]]) { 
SWAP(indx[1+1],indx[ir]) 

> 

if (arr[indx [1]] > arr[indx[1+1]]) { 
SWAP(indx[1],indx[1+1]) 



j=ir; 

indxt=indx[l+l] ; 
a=arr[indxt]; 
for (;;) { 
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do i++; while (arr[indx[i]] < a); 
do j—; while (arr[indx[j]] > a); 
if (j < i) break; 

SWAP(indx[i],indx[j]) 


> 

indx[1+1]=indx[j]; 
indx[j]=indxt; 
jstack += 2; 

if (jstack > NSTACK) nrerrorC'NSTACK too small in indexx."); 
if (ir-i+1 >= j-1) { 
istack[j stack]=ir; 
istack[j stack-1]=i; 
ir=j-l; 

} else { 

istack[j stack]=j-1; 
istack[j stack-1]=1; 


> 

> 

> 

free_ivector(istack,1,NSTACK); 


If you want to sort an array while making the corresponding rearrangement of 
several or many other arrays, you should first make an index table, then use it to 
rearrange each array in turn. This requires two arrays of working space: one to 
hold the index, and another into which an array is temporarily moved, and from 
which it is redeposited back on itself in the rearranged order. For 3 arrays, the 
procedure looks like this: 


#include "nrutil.h" 


void sort3(unsigned long n, float ra[] , float rb[], float rc[]) 

Sorts an array ra[l. .n] into ascending numerical order while making the corresponding re¬ 
arrangements of the arrays rb [1. .n] and rc [1 . .n] . An index table is constructed via the 
routine indexx. 

{ 

void indexx(unsigned long n, float arr[], unsigned long indx[]); 
unsigned long j,*iwksp; 
float *wksp; 


iwksp=lvector(1,n); 

wksp=vector(1,n); 

indexx(n,ra,iwksp); 

for (j=l; j<=n; j++) wksp[j]=ra[j] ; 

for (j=l; j<=n; j++) ra[j] =wksp[iwksp[j]] ; 

for (j=l;j<=n;j++) wksp[j]=rb[j]; 

for (j=l; j<=n; j++) rb[j]=wksp[iwksp[j]] ; 

for (j=l; j<=n; j++) wksp [j] =rc [j] ; 

for (j=l; j<=n; j++) rc [j] =wksp [iwksp [j] ] ; 

free_vector(wksp,1,n); 

free_lvector(iwksp,l,n); 


Make the index table. 

Save the array ra. 

Copy it back in rearranged order. 
Ditto rb. 

Ditto rc. 



The generalization to any other number of arrays is obviously straightforward. 

A rank table is different from an index table. A rank table’s jth entry gives the 
rank of the jth element of the original array of keys, ranging from 1 (if that element 
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was the smallest) to N (if that element was the largest). One can easily construct 
a rank table from an index table, however: 

void rank (unsigned long n, unsigned long indx[], unsigned long irankG) 

Given indx[l. .n] as output from the routine indexx, returns an array irankfl. .n], the 
corresponding table of ranks. 

{ 

unsigned long j; 

for (j=l;j<=n;j++) irank[indx[j]]=j; 

> 


Figure 8.4.1 summarizes the concepts discussed in this section. 


8.5 Selecting the Mth Largest 

Selection is sorting’s austere sister. (Say that five times quickly!) Where sorting 
demands the rearrangement of an entire data array, selection politely asks for a single 
returned value: What is the fcth smallest (or, equivalently, the m = N+ 1 — fcth largest) 
element out of N elements? The fastest methods for selection do, unfortunately, 
rearrange the array for their own computational purposes, typically putting all smaller 
elements to the left of the fcth, all larger elements to the right, and scrambling the 
order within each subset. This side effect is at best innocuous, at worst downright 
inconvenient. When the array is very long, so that making a scratch copy of it is taxing 
on memory, or when the computational burden of the selection is a negligible part 
of a larger calculation, one turns to selection algorithms without side effects, which 
leave the original array undisturbed. Such in place selection is slower than the faster 
selection methods by a factor of about 10. We give routines of both types, below. 

The most common use of selection is in the statistical characterization of a set 
of data. One often wants to know the median element in an array, or the top and 
bottom quartile elements. When N is odd, the median is the fcth element, with 
fc = (N +1)/2. When N is even, statistics books define the median as the arithmetic 
mean of the elements fc = N/ 2 and fc = N/2 + 1 (that is, N/2 from the bottom 
and N/2 from the top). If you accept such pedantry, you must perform two separate 
selections to find these elements. For N > 100 we usually define fc = N/2 to be 
the median element, pedants be damned. 

The fastest general method for selection, allowing rearrangement, is partition¬ 
ing, exactly as was done in the Quicksort algorithm (§8.2). Selecting a “random” 
partition element, one marches through the array, forcing smaller elements to the 
left, larger elements to the right. As in Quicksort, it is important to optimize the 
inner loop, using “sentinels” (§8.2) to minimize the number of comparisons. For 
sorting, one would then proceed to further partition both subsets. For selection, 
we can ignore one subset and attend only to the one that contains our desired fcth 
element. Selection by partitioning thus does not need a stack of pending operations, 
and its operations count scales as N rather than as N log N (see [1 ]). Comparison 
with sort in §8.2 should make the following routine obvious: 
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#define SWAP(a,b) temp=(a);(a)=(b);(b)=temp; 

float select(unsigned long k, unsigned long n, float arr[]) 

Returns the kth smallest value in the array arr[l. .n]. The input array will be rearranged 
to have this value in location arr[k], with all smaller elements moved to arr[l. .k-1] (in 
arbitrary order) and all larger elements in arr[k+l. .n] (also in arbitrary order). 

{ 

unsigned long i,ir,j,l,mid; 
float a,temp; 

1 = 1 ; 

ir=n; 

for (;;) { 

if (ir <= 1+1) { Active partition contains 1 or 2 elements, 

if (ir == 1+1 kk arr[ir] < arr[l]) { Case of 2 elements. 

SWAP(arr [1] ,arr[ir]) 

> 

return arr[k]; 

> else { 

mid=(l+ir) » 1; Choose median of left, center, and right el- 

SWAP(arr [mid] ,arr [1+1] ) ements as partitioning element a. Also 

if (arr[l] > arr[ir]) { rearrange so that arr[l] < arr [1+1], 

SWAP (arr [1] , arr [ir]) arr[ir] >arr[l+l], 

> 

if (arr [1+1] > arr[ir]) { 

SWAP (arr [1+1] , arr [ir] ) 

> 

if (arr[l] > arr [1+1]) { 

SWAP (arr [1] , arr [1+1]) 


i=l+l; 
j=ir; 

a=arr[1+1] ; 
for (;;) { 

do i++; while (arr[i] 
do j —; while (arr[j] 
if (j < i) break; 

SWAP (arr [i] ,arr[j]) 

> 

arr [l+l]=arr [j] ; 
arr [j]=a; 

if (] >= k) ir=j-l; 
if (j <= k) l=i; 

> 

> 

> 


Initialize pointers for partitioning. 

Partitioning element. 

Beginning of innermost loop, 
a) ; Scan up to find element > a. 

a) ; Scan down to find element < a. 

Pointers crossed. Partitioning complete. 

End of innermost loop. 

Insert partitioning element. 

Keep active the partition that contains the 
kth element. 



In-place, nondestructive, selection is conceptually simple, but it requires a lot 
of bookkeeping, and it is correspondingly slower. The general idea is to pick some 
number M of elements at random, to sort them, and then to make a pass through 
the array counting how many elements fall in each of the M + 1 intervals defined 
by these elements. The kth largest will fall in one such interval — call it the “live” 
interval. One then does a second round, first picking M random elements in the live 
interval, and then determining which of the new, finer, M + 1 intervals all presently 
live elements fall into. And so on, until the kth element is finally localized within a 
single array of size M, at which point direct selection is possible. 

How shall we pick M? The number of rounds, log M N = log 2 N/ log 2 M, 
will be smaller if M is larger; but the work to locate each element among M + 1 
subintervals will be larger, scaling as log 2 M for bisection, say. Each round 
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requires looking at all N elements, if only to find those that are still alive, while 
the bisections are dominated by the N that occur in the first round. Minimizing 
0(N log M N) + 0(N log 2 M) thus yields the result 


M ~ 2V te & N (8.5.1) 

The square root of the logarithm is so slowly varying that secondary considerations of 
machine timing become important. We use M = 64 as a convenient constant value. 

Two minor additional tricks in the following routine, selip, are (i) augmenting 
the set of M random values by an M + 1st, the arithmetic mean, and (ii) choosing 
the M random values “on the fly” in a pass through the data, by a method that makes 
later values no less likely to be chosen than earlier ones. (The underlying idea is to 
give element m > M an M/m chance of being brought into the set. You can prove 
by induction that this yields the desired result.) 

#include "nrutil.h" 

#define M 64 
#define BIG 1.0e30 

#define FREEALL free.vector(sel,1,M+2); free^lvector(isel,1,M+2); 

float selip(unsigned long k, unsigned long n, float arr[]) 

Returns the kth smallest value in the array arr [1. .n] . The input array is not altered. 

i 

void shell(unsigned long n, float a[]); 
unsigned long i,j,jl,jm,ju,kk,mm,nlo.nxtirnn,*isel; 
float ahi,alo,sum,*sel; 

if (k < 1 || k > n || n <= 0) nrerror("bad input to selip"); 

isel=lvector(l,M+2); 

sel=vector(l,M+2); 

kk=k; 

ahi=BIG; 

alo = -BIG; 

for (;;) { Main iteration loop, until desired ele- 

imn=nlo=0; ment is isolated. 

sum=0.0; 
nxtmm=M+l; 

for (i=l;i<=n;i++) { Make a pass through the whole array, 

if (arr[i] >= alo kk arr[i] <= ahi) { 

Consider only elements in the current brackets. 
mm++; 

if (arr[i] == alo) nlo++; In case of ties for low bracket. 

Now use statistical procedure for selecting m in-range elements with equal 
probability, even without knowing in advance how many there are! 
if (mm <= M) sel [mm] =arr [i] ; 
else if (mm == nxtmm) { 
nxtmm=mm+mm/M; 

sel[l + ((i+mm+kk) ’/, M)]=arr[i]; The’/. operation provides a some- 
} what random number, 

sum += arr [i] ; 

> 

> 

if (kk <= nlo) { Desired element is tied for lower bound; 

FREEALL return it. 

return alo; 

> 

else if (mm <= M) { 
shell(mm,sel); 
ahi = sel [kk] ; 



All in-range elements were kept. So re¬ 
turn answer by direct method. 
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FREEALL 
return ahi; 

> 

sel [M+l] =sum/nmi; Augment selected set by mean value (fixes 

shell(M+l,sel) ; degeneracies), and sort it. 

sel[M+2]=ahi; 

for (j=l;j<=M+2;j++) isel[j]=0; Zero the count array, 

for (i=l;i<=n;i++) { Make another pass through the array, 

if (arr[i] >= alo arr[i] <= ahi) { For each in-range element.. 


ju=M+2; 

while (ju-jl > 1) { 
jm=(ju+jl)/2; 
if (arr[i] >= sel[jm] 
else ju=jm; 

> 

isel[ju]++; 

> 

> 

j =1 ; 

while (kk > isel[j]) { 
alo=sel [j] ; 
kk -= isel [j++] ; 

> 

ahi=sel[j] ; 

> 

> 


...find its position among the select by 
bisection... 

...and increment the counter. 

Now we can narrow the bounds to just 
one bin, that is, by a factor of order 
m. 


Approximate timings: selip is about 10 times slower than select. Indeed, 
for N in the range of ~ 10 5 , selip is about 1.5 times slower than a full sort with 
sort, while select is about 6 times faster than sort. You should weigh time 
against memory and convenience carefully. 

Of course neither of the above routines should be used for the trivial cases of 
finding the largest, or smallest, element in an array. Those cases, you code by hand 
as simple for loops. There are also good ways to code the case where k is modest in 
comparison to N, so that extra memory of order k is not burdensome. An example 
is to use the method of Heapsort (§8.3) to make a single pass through an array of 
length N while saving the m largest elements. The advantage of the heap structure 
is that only log to, rather than to, comparisons are required every time a new element 
is added to the candidate list. This becomes a real savings when to > 0(\fN), but 
it never hurts otherwise and is easy to code. The following program gives the idea. 

void hpsel(unsigned long m, unsigned long n, float arr[], float heap[]) 

Returns in heap [1. .m] the largest m elements of the array arr [1. .n] , with heap [1] guaran¬ 
teed to be the the mth largest element. The array arr is not altered. For efficiency, this routine 
should be used only when m -C n. 

{ 

void sort(unsigned long n, float arr[]); 
void nrerror(char error_text[]); 
unsigned long i,j,k; 
float swap; 

if (m > n/2 I I m < 1) nrerror("probable misuse of hpsel"); 
for (i=l; i<=m;i++) heap[i]=arr[i] ; 

sort(m,heap); Create initial heap by overkill! We assume m -C n. 

for (i=m+l;i<=n;i++) { For each remaining element... 

if (arr[i] > heap[l]) { Put it on the heap? 
heap [1] =arr [i] ; 
for (j=l;;) { 



Sift down. 
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k=j « 1; 

if (k > m) break; 

if (k != m kk heap[k] > heap[k+l]) k++; 

if (heap[j] <= heap[k]) break; 

swap=heap[k]; 

heap [k] =heap [j ] ; 

heap[j]=swap; 

j=k; 

> 

> 

> 

> 


CITED REFERENCES AND FURTHER READING: 

Sedgewick, R. 1988, Algorithms, 2nd ed. (Reading, MA: Addison-Wesley), pp. 126ff. [1] 

Knuth, D.E. 1973, Sorting and Searching, vol. 3 of The Art of Computer Programming (Reading, 
MA: Addison-Wesley). 


8.6 Determination of Equivalence Classes 


A number of techniques for sorting and searching relate to data structures whose details 
are beyond the scope of this book, for example, trees, linked lists, etc. These structures and 
their manipulations are the bread and butter of computer science, as distinct from numerical 
analysis, and there is no shortage of books on the subject. 

In working with experimental data, we have found that one particular such manipulation, 
namely the determination of equivalence classes, arises sufficiently often to justify inclusion 
here. 

The problem is this: There are N “elements” (or “data points” or whatever), numbered 
1 You are given pairwise information about whether elements are in the same 

equivalence class of “sameness,” by whatever criterion happens to be of interest. For example, 
you may have a list of facts like: “Element 3 and element 7 are in the same class; element 
19 and element 4 are in the same class; element 7 and element 12 are in the same class. 
Alternatively, you may have a procedure, given the numbers of two elements j and k, for 
deciding whether they are in the same class or different classes. (Recall that an equivalence 
relation can be anything satisfying the RSTproperties: reflexive, symmetric, transitive. This 
is compatible with any intuitive definition of “sameness.”) 

The desired output is an assignment to each of the N elements of an equivalence class 
number, such that two elements are in the same class if and only if they are assigned the 
same class number. 

Efficient algorithms work like this: Let F(j) be the class or “family” number of element 
j. Start off with each element in its own family, so that F(j) = j. The array F(j) can be 
interpreted as a tree structure, where F ( j ) denotes the parent of j. If we arrange for each family 
to be its own tree, disjoint from all the other “family trees,” then we can label each family 
(equivalence class) by its most senior great-great-.. .grandparent. The detailed topology of 
the tree doesn’t matter at all, as long as we graft each related element onto it somewhere. 

Therefore, we process each elemental datum “j is equivalent to k” by (i) tracking j 
up to its highest ancestor, (ii) tracking k up to its highest ancestor, (iii) giving j to k as a 
new parent, or vice versa (it makes no difference). After processing all the relations, we go 
through all the elements j and reset their F(j)’s to their highest possible ancestors, which 
then label the equivalence classes. 

The following routine, based on Knuth [1 ], assumes that there are m elemental pieces 
of information, stored in two arrays of length m, lista,listb, the interpretation being 
that lista[j] and listb [j], j=l. . .m, are the numbers of two elements which (we are 
thus told) are related. 
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void eclass(int nf[], int n, int lista[] , int listb[] , int m) 

Given m equivalences between pairs of n individual elements in the form of the input arrays 
lista[l. .m] and listb [1. .m] , this routine returns in nf [1. .n] the number of the equiv¬ 
alence class of each of the n elements, integers between 1 and n (not all such integers used). 

{ 

int 

for 
for 


> 

for 

> 

Alternatively, we may be able to construct a function equiv(j ,k) that returns anonzero 
(true) value if elements j and k are related, or a zero (false) value if they are not. Then we 
want to loop over all pairs of elements to get the complete picture. D. Eardley has devised 
a clever way of doing this while simultaneously sweeping the tree up to high ancestors in a 
manner that keeps it current and obviates most of the final sweep phase: 

void eclazzfint nf[], int n, int (*equiv) (int, int)) 

Given a user-supplied boolean function equiv which tells whether a pair of elements, each in 
the range 1. . .n, are related, return in nf [1. .n] equivalence class numbers for each element. 
{ 

int kk,j j; 
nf Cl]=1; 

for (j j=2; jj<=n; j j++) { Loop over first element of all pairs. 

nf [jj]=jj; 

for (kk=l ;kk<=(jj-1) ;kk++) { Loop over second element of all pairs, 

nf [kk] =nf [nf [kk] ] ; Sweep it up this much, 

if ((*equiv)(jj,kk)) nf [nf[nf [kk]]] = j j ; 

Good exercise for the reader to figure out why this much ancestry is necessary! 

> 

> 

for (j j=l; jj<=n; j j++) nf [j j] =nf [nf [j j] ] ; Only this much sweeping is needed 

> finally. 


l.k.j; 

(k=l ;k<=n;k++) nf[k]=k; 

(1=1;l<=m;l++) { 
j=lista[l] ; 

while (nf[j] != j) j=nf[j]; 
k=listb[l] ; 

while (nf[k] != k) k=nf [k] ; 
if (j ! = k) nf [j] =k; 

(j=l;j<=n;j++) 

while (nf[j] != nf[nf[j]]) nf[j] 


Initialize each element its own class. 

For each piece of input information... 

Track first element up to its ancestor. 

Track second element up to its ancestor. 

If they are not already related, make them 
so. 

Final sweep up to highest ancestors. 

=nf [nf [j]] ; 


CITED REFERENCES AND FURTHER READING: 

Knuth, D.E. 1968, Fundamental Algorithms, vol. 1 of The Art of Computer Programming (Reading, 
MA: Addison-Wesley), §2.3.3. [1] 

Sedgewick, R. 1988, Algorithms, 2nd ed. (Reading, MA: Addison-Wesley), Chapter 30. 
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Chapter 9. Root Finding and 
Nonlinear Sets of Equations 

9.0 Introduction 

We now consider that most basic of tasks, solving equations numerically. While 
most equations are born with both a right-hand side and a left-hand side, one 
traditionally moves all terms to the left, leaving 

f(x) = 0 (9.0.1) 

whose solution or solutions are desired. When there is only one independent variable, 
the problem is one-dimensional, namely to find the root or roots of a function. 

With more than one independent variable, more than one equation can be 
satisfied simultaneously. You likely once learned the implicit function theorem 
which (in this context) gives us the hope of satisfying N equations in N unknowns 
simultaneously. Note that we have only hope, not certainty. A nonlinear set of 
equations may have no (real) solutions at all. Contrariwise, it may have more than 
one solution. The implicit function theorem tells us that “generically” the solutions 
will be distinct, pointlike, and separated from each other. If, however, life is so 
unkind as to present you with a nongeneric, i.e., degenerate, case, then you can get 
a continuous family of solutions. In vector notation, we want to find one or more 
A-dimensional solution vectors x such that 

f(x) = 0 (9.0.2) 

where f is the A'-dimensional vector-valued function whose components are the 
individual equations to be satisfied simultaneously. 

Don’t be fooled by the apparent notational similarity of equations (9.0.2) and 
(9.0.1). Simultaneous solution of equations in N dimensions is much more difficult 
than finding roots in the one-dimensional case. The principal difference between one 
and many dimensions is that, in one dimension, it is possible to bracket or “trap” a root 
between bracketing values, and then hunt it down like a rabbit. In multidimensions, 
you can never be sure that the root is there at all until you have found it. 

Except in linear problems, root finding invariably proceeds by iteration, and 
this is equally true in one or in many dimensions. Starting from some approximate 
trial solution, a useful algorithm will improve the solution until some predetermined 
convergence criterion is satisfied. For smoothly varying functions, good algorithms 
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will always converge, provided that the initial guess is good enough. Indeed one can 
even determine in advance the rate of convergence of most algorithms. 

It cannot be overemphasized, however, how crucially success depends on having 
a good first guess for the solution, especially for multidimensional problems. This 
crucial beginning usually depends on analysis rather than numerics. Carefully crafted 
initial estimates reward you not only with reduced computational effort, but also 
with understanding and increased self-esteem. Hamming’s motto, “the purpose of 
computing is insight, not numbers,” is particularly apt in the area of finding roots. 
You should repeat this motto aloud whenever your program converges, with ten-digit 
accuracy, to the wrong root of a problem, or whenever it fails to converge because 
there is actually no root, or because there is a root but your initial estimate was 
not sufficiently close to it. 

“This talk of insight is all very well, but what do I actually do?” For one¬ 
dimensional root finding, it is possible to give some straightforward answers: You 
should try to get some idea of what your function looks like before trying to find 
its roots. If you need to mass-produce roots for many different functions, then you 
should at least know what some typical members of the ensemble look like. Next, 
you should always bracket a root, that is, know that the function changes sign in an 
identified interval, before trying to converge to the root’s value. 

Finally (this is advice with which some daring souls might disagree, but we 
give it nonetheless) never let your iteration method get outside of the best bracketing 
bounds obtained at any stage. We will see below that some pedagogically important 
algorithms, such as secant method or Newton-Raphson, can violate this last constraint, 
and are thus not recommended unless certain fixups are implemented. 

Multiple roots, or very close roots, are a real problem, especially if the 
multiplicity is an even number. In that case, there may be no readily apparent 
sign change in the function, so the notion of bracketing a root — and maintaining 
the bracket — becomes difficult. We are hard-liners: we nevertheless insist on 
bracketing a root, even if it takes the minimum-searching techniques of Chapter 10 
to determine whether a tantalizing dip in the function really does cross zero or not. 
(You can easily modify the simple golden section routine of §10.1 to return early 
if it detects a sign change in the function. And, if the minimum of the function is 
exactly zero, then you have found a double root.) 

As usual, we want to discourage you from using routines as black boxes without 
understanding them. However, as a guide to beginners, here are some reasonable 
starting points: 

• Brent’s algorithm in §9.3 is the method of choice to find a bracketed root 
of a general one-dimensional function, when you cannot easily compute 
the function’s derivative. Ridders’ method (§9.2) is concise, and a close 
competitor. 

• When you can compute the function’s derivative, the routine rtsaf e in 
§9.4, which combines the Newton-Raphson method with some bookkeep¬ 
ing on bounds, is recommended. Again, you must first bracket your root. 

• Roots of polynomials are a special case. Laguerre’s method, in §9.5, 
is recommended as a starting point. Beware: Some polynomials are 
ill-conditioned! 

• Finally, for multidimensional problems, the only elementary method is 
Newton-Raphson (§9.6), which works very well if you can supply a 
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good first guess of the solution. Try it. Then read the more advanced 
material in §9.7 for some more complicated, but globally more convergent, 
alternatives. 

Avoiding implementations for specific computers, this book must generally 
steer clear of interactive or graphics-related routines. We make an exception right 
now. The following routine, which produces a crude function plot with interactively 
scaled axes, can save you a lot of grief as you enter the world of root finding. 

#include <stdio.h> 

#define ISCR 60 Number of horizontal and vertical positions in display. 

#define JSCR 21 
#define BLANK ’ > 

#define ZERO 
#def ine YY ’1’ 

#define XX 
#define FF ’x’ 

void scrsho(float (*fx)(float)) 

For interactive CRT terminal use. Produce a crude graph of the function fx over the prompted- 
for interval xl,x2. Query for another plot until the user signals satisfaction. 

{ 

int j z,j,i; 

float ysml,ybig,x2,xl,x,dyj,dx,y[ISCR+1]; 
char scr[ISCR+1][JSCR+1]; 



for 


(;;) { 

printf ("\nEnter xl x2 (xl=x2 to stop):\n"); Query for another plot, quit 


scanf('7.f •/„f",fexl,&x2); 
if (xl == x2) break; 
for (j=l;j<=JSCR;j++) 

scr [1] [j]=scr [ISCR] [j] =YY; 
for (i=2;i<=(ISCR-l);i++) { 
scr [i] [1] =scr [i] [JSCR]=XX; 
for (j=2;j<=(JSCR-l);j++) 
scr[i][j]=BLANK; 

> 

dx=(x2-xl)/(ISCR-1); 
x=xl; 

ysml=ybig=0.0; 
for (i=l;i<=ISCR;i++) { 
y[i] = (*fx) (x); 
if (y [i] < ysml) ysml=y [i]; 
if (y [i] > ybig) ybig=y[i]; 
x += dx; 

> 

if (ybig == ysml) ybig=ysml+l.0; 
dyj=(JSCR-l)/(ybig-ysml); 
jz=l-(int) (ysml*dyj); 
for (i=l;i<=ISCR;i++) { 
scr[i] [j z] =ZER0; 
j=l+(int) ((y[i]-ysml)*dyj); 
scr [i] [j] =FF ; 

> 

printf ( " "/ 0 10.3f ", ybig) ; 

for (i=l;i<=ISCR;i++) printf (""/,c" , 

printf ("\n"); 

for (j=(JSCR-l);j>=2;j—) { 
printf("%12s"," "); 
for (i=l; i<=ISCR; i++) printf ("'/.< 
printf("\n"); 

> 

printf(" %10.3f ",ysml); 


if xl=x2. 

Fill vertical sides with character T'. 


Fill top, bottom with character 
Fill interior with blanks. 


Limits will include 0. 

Evaluate the function at equal intervals. 
Find the largest and smallest val¬ 
ues. 


Be sure to separate top and bottom. 

Note which row corresponds to 0. 

Place an indicator at function height and 
0 . 


:r[i] [JSCR]); 
Display. 

, scr [i] [j]); 
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for (i=l;i<=ISCR;i++) printf ("7,c" ,scr [i] [1]); 
printf ("\n"); 

printf("’/,8s "/.10.3f "/.44s "/.10.3f\n"," ",xl," ",x2); 

> 

> 


CITED REFERENCES AND FURTHER READING: 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
Chapter 5. 

Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), Chapters 2, 7, and 14. 

Ralston, A., and Rabinowitz, P. 1978, A First Course in Numerical Analysis, 2nd ed. (New York: 
McGraw-Hill), Chapter 8. 

Householder, A.S. 1970, The Numerical Treatment of a Single Nonlinear Equation (New York: 
McGraw-Hill). 


9.1 Bracketing and Bisection 

We will say that a root is bracketed in the interval (a, b) if f(a) and fib) have 
opposite signs. If the function is continuous, then at least one root must lie in 
that interval (the intermediate value theorem ). If the function is discontinuous, but 
bounded, then instead of a root there might be a step discontinuity which crosses 
zero (see Figure 9.1.1). For numerical purposes, that might as well be a root, since 
the behavior is indistinguishable from the case of a continuous function whose zero 
crossing occurs in between two “adjacent” floating-point numbers in a machine’s 
finite-precision representation. Only for functions with singularities is there the 
possibility that a bracketed root is not really there, as for example 

m (9.u) 

x — c 

Some root-finding algorithms (e.g., bisection in this section) will readily converge 
to c in (9.1.1). Luckily there is not much possibility of your mistaking c, or any 
number x close to it, for a root, since mere evaluation of |/(a;)| will give a very 
large, rather than a very small, result. 

If you are given a function in a black box, there is no sure way of bracketing 
its roots, or of even determining that it has roots. If you like pathological examples, 
think about the problem of locating the two real roots of equation (3.0.1), which dips 
below zero only in the ridiculously small interval of about x = n ± 10 _667 . 

In the next chapter we will deal with the related problem of bracketing a 
function’s minimum. There it is possible to give a procedure that always succeeds; 
in essence, “Go downhill, taking steps of increasing size, until your function starts 
back uphill.” There is no analogous procedure for roots. The procedure “go downhill 
until your function changes sign,” can be foiled by a function that has a simple 
extremum. Nevertheless, if you are prepared to deal with a “failure” outcome, this 
procedure is often a good first start; success is usual if your function has opposite 
signs in the limit x —> ±oo. 
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Figure 9.1.1. Some situations encountered while root finding: (a) shows an isolated root ai bracketed 
by two points a and b at which the function has opposite signs; (b) illustrates that there is not necessarily 
a sign change in the function near a double root (in fact, there is not necessarily a root!); (c) is a 
pathological function with many roots; in (d) the function has opposite signs at points a and b, but the 
points bracket a singularity, not a root. 
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#include <math.h> 

#define FACTOR 1.6 
#define NTRY 50 

int zbrac(float (*func)(float), float *xl, float *x2) 

Given a function func and an initial guessed range xl to x2, the routine expands the range 
geometrically until a root is bracketed by the returned values xl and x2 (in which case zbrac 
returns 1) or until the range becomes unacceptably large (in which case zbrac returns 0). 
i 

void nrerror(char error_text []); 
int j; 

float fl,f2; 

if (*xl == *x2) nrerror("Bad initial range in zbrac"); 

fl=(*func)(*xl); 

f2=(*func)(*x2); 

for (j=l;j<=NTRY;j++) { 

if (fl*f2 < 0.0) return 1; 
if (fabs(fl) < fabs(f2)) 

fl=(*func)(*xl += FACTOR*(*xl-*x2)); 

else 

f2=(*func)(*x2 += FACTOR*(*x2-*xl)); 

> 

return 0; 


Alternatively, you might want to “look inward” on an initial interval, rather 
than “look outward” from it, asking if there are any roots of the function f(x) in 
the interval from x\ to a:2 when a search is carried out by subdivision into n equal 
intervals. The following function calculates brackets for up to nb distinct intervals 
which each contain one or more roots. 


void zbrak(float (*fx)(float), float xl, float x2, int n, float xbl[], 
float xb2[], int *nb) 

Given a function fx defined on the interval from xl-x2 subdivide the interval into n equally 
spaced segments, and search for zero crossings of the function, nb is input as the maximum num¬ 
ber of roots sought, and is reset to the number of bracketing pairs xbl [1. .nb] , xb2 [1. .nb] 
that are found. 

{ 

int nbb,i; 
float x,fp,fc,dx; 

nbb=0; 

dx=(x2-xl)/n; 
fp=(*fx)(x=xl); 
for (i=l;i<=n;i++) { 
fc=(*fx)(x += dx); 
if (fc*fp <= 0.0) { 
xbl[++nbb]=x-dx; 
xb2[nbb]=x; 
if(*nb == nbb) return 


Determine the spacing appropriate to the mesh. 
Loop over all intervals 

If a sign change occurs then record values for the 
bounds. 


> 

fp=fc; 

> 

*nb = nbb; 
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Bisection Method 

Once we know that an interval contains a root, several classical procedures are 
available to refine it. These proceed with varying degrees of speed and sureness 
towards the answer. Unfortunately, the methods that are guaranteed to converge plod 
along most slowly, while those that rush to the solution in the best cases can also dash 
rapidly to infinity without warning if measures are not taken to avoid such behavior. 

The bisection method is one that cannot fail. It is thus not to be sneered at as 
a method for otherwise badly behaved problems. The idea is simple. Over some 
interval the function is known to pass through zero because it changes sign. Evaluate 
the function at the interval’s midpoint and examine its sign. Use the midpoint to 
replace whichever limit has the same sign. After each iteration the bounds containing 
the root decrease by a factor of two. If after n iterations the root is known to 
be within an interval of size e n , then after the next iteration it will be bracketed 
within an interval of size 

e n+ i = e n /2 (9.1.2) 

neither more nor less. Thus, we know in advance the number of iterations required 
to achieve a given tolerance in the solution, 

n = log 2 (9.1.3) 

where eo is the size of the initially bracketing interval, e is the desired ending 
tolerance. 

Bisection must succeed. If the interval happens to contain two or more roots, 
bisection will find one of them. If the interval contains no roots and merely straddles 
a singularity, it will converge on the singularity. 

When a method converges as a factor (less than 1) times the previous uncertainty 
to the first power (as is the case for bisection), it is said to converge linearly. Methods 
that converge as a higher power, 

e n +i = constant x ( e„) m m > 1 (9.1.4) 

are said to converge superlinearly. In other contexts “linear” convergence would be 
termed “exponential,” or “geometrical.” That is not too bad at all: Linear convergence 
means that successive significant figures are won linearly with computational effort. 

It remains to discuss practical criteria for convergence. It is crucial to keep in 
mind that computers use a fixed number of binary digits to represent floating-point 
numbers. While your function might analytically pass through zero, it is possible that 
its computed value is never zero, for any floating-point argument. One must decide 
what accuracy on the root is attainable: Convergence to within 10 “ 6 in absolute 
value is reasonable when the root lies near 1, but certainly unachievable if the root 
lies near 10 26 . One might thus think to specify convergence by a relative (fractional) 
criterion, but this becomes unworkable for roots near zero. To be most general, the 
routines below will require you to specify an absolute tolerance, such that iterations 
continue until the interval becomes smaller than this tolerance in absolute units. 
Usually you may wish to take the tolerance to be e(|#i| + £2 1)/2 where e is the 
machine precision and x 1 and £2 are the initial brackets. When the root lies near zero 
you ought to consider carefully what reasonable tolerance means for your function. 
The following routine quits after 40 bisections in any event, with 2 _4 ° « 10 -12 . 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



354 


Chapter 9. Root Finding and Nonlinear Sets of Equations 


#include <math.h> 

#define JMAX 40 Maximum allowed number of bisections. 

float rtbisCfloat (*func)(float), float xl, float x2, float xacc) 

Using bisection, find the root of a function func known to lie between xl and x2. The root, 
returned as rtbis, will be refined until its accuracy is ±xacc. 

{ 

void nrerror(char error_text []); 
int j; 

float dx,f,fmid,xmid,rtb; 

f=(*func)(xl); 
fmid=(*frmc) (x2) ; 

if (f*fmid >= 0.0) nrerrorCRoot must be bracketed for bisection in rtbis"); 
rtb = f < 0.0 ? (dx=x2-xl,xl) : (dx=xl-x2,x2) ; Orient the search so that f>0 

for (j=1; j<= JMAX ; j ++) { lies at x+dx. 

fmid=(*func) (xmid=rtb+(dx *= 0.5)); Bisection loop, 

if (fmid <= 0.0) rtb=xmid; 

if (fabs(dx) < xacc I I fmid == 0.0) return rtb; 

> 

nrerrorC'Too many bisections in rtbis"); 

return 0.0; Never get here. 


9.2 Secant Method, False Position Method, 
and Ridders’ Method 


For functions that are smooth near a root, the methods known respectively 
as false position (or regula falsi) and secant method generally converge faster than 
bisection. In both of these methods the function is assumed to be approximately 
linear in the local region of interest, and the next improvement in the root is taken as 
the point where the approximating line crosses the axis. After each iteration one of 
the previous boundary points is discarded in favor of the latest estimate of the root. 

The only difference between the methods is that secant retains the most recent 
of the prior estimates (Figure 9.2.1; this requires an arbitrary choice on the first 
iteration), while false position retains that prior estimate for which the function value 
has opposite sign from the function value at the current best estimate of the root, 
so that the two points continue to bracket the root (Figure 9.2.2). Mathematically, 
the secant method converges more rapidly near a root of a sufficiently continuous 
function. Its order of convergence can be shown to be the “golden ratio” 1.618 
so that 


^lim^ |e fc+ i| «■ const x lefe) 1 ’ 618 (9.2.1) 



The secant method has, however, the disadvantage that the root does not necessarily 
remain bracketed. For functions that are not sufficiently continuous, the algorithm 
can therefore not be guaranteed to converge: Local behavior might send it off 
towards infinity. 
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Figure 9.2.3. Example where both the secant and false position methods will take many iterations to 
arrive at the true root. This function would be difficult for many other root-finding methods. 

False position, since it sometimes keeps an older rather than newer function 
evaluation, has a lower order of convergence. Since the newer function value will 
sometimes be kept, the method is often superlinear, but estimation of its exact order 
is not so easy. 

Here are sample implementations of these two related methods. While these 
methods are standard textbook fare, Ridders’ method, described below, or Brent’s 
method, in the next section, are almost always better choices. Figure 9.2.3 shows the 
behavior of secant and false-position methods in a difficult situation. 

#include <math.h> 

#define MAXIT 30 Set to the maximum allowed number of iterations. 

float rtflspCfloat (*func)(float), float xl, float x2, float xacc) 

Using the false position method, find the root of a function func known to lie between xl and 
x2. The root, returned as rtflsp, is refined until its accuracy is ±xacc. 

{ 

void nrerror(char error_text[]); 
int j; 

float fl,fh,xl,xh ) svap ) dx > del > f,rtf; 
fl=(*func)(xl); 

fh=(*func)(x2) ; Be sure the interval brackets a root, 

if (fl*fh > 0.0) nrerror("Root must be bracketed in rtflsp"); 
if (fl < 0.0) { Identify the limits so that xl corresponds to the low 

xl=xl; side. 

xh=x2; 

> else { 
xl=x2; 
xh=xl; 
swap=fl; 
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f l=fh; 
fh=swap; 

> 

dx=xh-xl; 

for (j=l;j<=MAXIT;j++) { False position loop. 

rtf=xl+dx*fl/(fl-fh); Increment with respect to latest value. 

f=(*func)(rtf); 

if (f < 0.0) { Replace appropriate limit. 

del=xl-rtf; 
xl=rtf; 
fl=f; 

> else { 

del=xh-rtf; 
xh=rtf; 
fh=f; 

> 

dx=xh-xl; 

if (fabs(del) < xacc I I f == 0.0) return rtf; Convergence. 

> 

nrerror("Maximum number of iterations exceeded in rtflsp"); 
return 0.0; Never get here. 


#include <math.h> 

#define MAXIT 30 Maximum allowed number of iterations. 

float rtsec(float (*func)(float), float xl, float x2, float xacc) 

Using the secant method, find the root of a function func thought to lie between xl and x2. 
The root, returned as rtsec, is refined until its accuracy is ±xacc. 

f 

void nrerror(char error_text []); 
int j; 

float fl,f,dx,swap,xl,rts; 


fl=(*func)(xl); 
f=(*func)(x2); 
if (fabs(fl) < fabs(f)) { 
rts=xl; 
xl=x2; 
swap=f1; 
fl=f; 
f=swap; 

> else { 

xl=xl; 
rts=x2; 

> 

for (j=l;j<=MAXIT;j++) { 
dx=(xl-rts)*f/(f-f1); 
xl=rts; 


Pick the bound with the smaller function value as 
the most recent guess. 


Secant loop. 

Increment with respect to latest value. 


fl=f; 

rts += dx; 
f=(*func)(rts); 

if (fabs(dx) < xacc I I f ==0.0) return rts; Convergence. 

> 

nrerror("Maximum number of iterations exceeded in rtsec"); 
return 0.0; Never get here. 



> 
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Ridders’ Method 

A powerful variant on false position is due to Ridders [1 ]. When a root is 
bracketed between x\ and X 2 , Ridders’ method first evaluates the function at the 
midpoint x 3 = (xi + M$/2. It then factors out that unique exponential function 
which turns the residual function into a straight line. Specifically, it solves for a 
factor that gives 


f(x 1 ) - 2 f(x 3 )e Q + f(x 2 )e 2Q = 0 
This is a quadratic equation in e®, which can be solved to give 

q = f(x 3 ) + siga[f(x 2 )]y/f(x 3 ) 2 - f(xi)f(x2) 

6 f<M 

Now the false position method is applied, not to the values f(xi),f(x 3 ),f(x2), but 
to the values f(x 1 ), f(x 3 )e^, f(x2)e 2 ®, yielding a new guess for the root, x 4 . The 
overall updating formula (incorporating the solution 9.2.3) is 

(9.2.4) 

v//fe) 2 -/fe)/fe) 

Equation (9.2.4) has some very nice properties. First, X 4 is guaranteed to lie 
in the interval (xi,x 2 ), so the method never jumps out of its brackets. Second, 
the convergence of successive applications of equation (9.2.4) is quadratic, that is, 
m = 2 in equation (9.1.4). Since each application of (9.2.4) requires two function 
evaluations, the actual order of the method is \/ 2 , not 2 ; but this is still quite 
respectably superlinear: the number of significant digits in the answer approximately 
doubles with each two function evaluations. Third, taking out the function’s “bend” 
via exponential (that is, ratio) factors, rather than via a polynomial technique (e.g., 
fitting a parabola), turns out to give an extraordinarily robust algorithm. In both 
reliability and speed, Ridders’ method is generally competitive with the more highly 
developed and better established (but more complicated) method of Van Wij ngaarden, 
Dekker, and Brent, which we next discuss. 

#include <math.h> 

#include "nrutil.h" 

#define MAXIT 60 

#define UNUSED (-l.lle30) 

float zriddr(float (*func)(float), float xl, float x2, float xacc) 

Using Ridders' method, return the root of a function func known to lie between xl and x2. 
The root, returned as zriddr, will be refined to an approximate accuracy xacc. 

{ 

int j; 

float ans,fh,f1,fm,fnew,s,xh,xl,xm,xnew; 

fl=(*func)(xl); 

fh=(*func)(x2); 

if ((fl > 0.0 kk fh < 0.0) II (fl < 0.0 kk fh > 0.0)) { 
xl=xl; 
xh=x2; 
ans=UNUSED; 


(9.2.2) 


(9.2.3) 



Any highly unlikely value, to simplify logic 
below. 
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> 


for (j=l;j<=MAXIT;j++) { 
xm=0.5*(xl+xh); 

fm=(*func) (xm); First of two function evaluations per it- 

s=sqrt (fm*fm-f l*fh); eration. 

if (s == 0.0) return ans; 

xnew=xm+(xm-xl)*((fl >= fh ? 1.0 : -1.0)*fm/s); Updating formula, 
if (fabs(xnew-ans) <= xacc) return ans; 


ans=xnew; 

fnew=(*func)(ans); 
if (fnew == 0.0) return ans; 
if (SIGN(fm,fnew) != fm) { 
xl=xm; 


Second of two function evaluations per 
iteration. 

Bookkeeping to keep the root bracketed 
on next iteration. 


fl=fm; 
xh=ans; 
fh=fnew; 

> else if (SIGN(fl,fnew) != fl) { 

xh=ans; 
fh=fnew; 

> else if (SIGN(fh,fnew) != fh) { 

xl=ans; 
fl=fnew; 

> else nrerror("never get here."); 
if (fabs(xh-xl) <= xacc) return ans; 


> 

nrerror("zriddr exceed maximum iterations"); 

> 

else { 

if (fl == 0.0) return xl; 
if (fh == 0.0) return x2; 

nrerror("root must be bracketed in zriddr."); 


> 

return 0.0; Never get here. 


CITED REFERENCES AND FURTHER READING: 

Ralston, A., and Rabinowitz, P. 1978, A First Course in Numerical Analysis, 2nd ed. (New York: 
McGraw-Hill), §8.3. 

Ostrowski, A.M. 1966, Solutions of Equations and Systems of Equations, 2nd ed. (New York: 
Academic Press), Chapter 12. 

Ridders, C.J.F. 1979, IEEE Transactions on Circuits and Systems, vol. CAS-26, pp. 979-980. [1] 


9.3 Van Wijngaarden-Dekker-Brent Method 

While secant and false position formally converge faster than bisection, one 
finds in practice pathological functions for which bisection converges more rapidly. 
These can be choppy, discontinuous functions, or even smooth functions if the 
second derivative changes sharply near the root. Bisection always halves the interval, 
while secant and false position can sometimes spend many cycles slowly pulling 
distant bounds closer to a root. Ridders’ method does a much better job, but it 
too can sometimes be fooled. Is there a way to combine superlinear convergence 
with the sureness of bisection? 
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Yes. We can keep track of whether a supposedly superlinear method is actually 
converging the way it is supposed to, and, if it is not, we can intersperse bisection 
steps so as to guarantee at least linear convergence. This kind of super-strategy 
requires attention to bookkeeping detail, and also careful consideration of how 
roundoff errors can affect the guiding strategy. Also, we must be able to determine 
reliably when convergence has been achieved. 

An excellent algorithm that pays close attention to these matters was developed 
in the 1960s by van Wijngaarden, Dekker, and others at the Mathematical Center 
in Amsterdam, and later improved by Brent [1], For brevity, we refer to the final 
form of the algorithm as Brent’s method. The method is guaranteed (by Brent) 
to converge, so long as the function can be evaluated within the initial interval 
known to contain a root. 

Brent’s method combines root bracketing, bisection, and inverse quadratic 
interpolation to converge from the neighborhood of a zero crossing. While the false 
position and secant methods assume approximately linear behavior between two 
prior root estimates, inverse quadratic interpolation uses three prior points to fit an 
inverse quadratic function (x as a quadratic function of y) whose value at y = 0 is 
taken as the next estimate of the root x. Of course one must have contingency plans 
for what to do if the root falls outside of the brackets. Brent’s method takes care of 
all that. If the three point pairs are [a, /(a)], [b , /(&)], [c, /(c)] then the interpolation 
formula (cf. equation 3.1.1) is 

= [y ~ f(a)][y ~ f(b)]c [y - f{b)][y - f(c)]q 

[/(c) - f(a)Mc) - f(b )] + If (a) - f(b )] [/(a) - /(c)] 

, [y-f(c)][y-f(a))b [ ■ ' > 

lf(b) ~ f(c)][f(b) - f(a)} 

Setting y to zero gives a result for the next root estimate, which can be written as 




we have 

P = S [T(R - T)(e — b) - (1 R)(b - a)] (9.3.4) 

Q = (T-1)(R- 1)(S- 1) (9.3.5) 

In practice b is the current best estimate of the root and P/Q ought to be a “small” 
correction. Quadratic methods work well only when the function behaves smoothly; 
they run the serious risk of giving very bad estimates of the next root or causing 
machine failure by an inappropriate division by a very small number (Q « 0). 
Brent’s method guards against this problem by maintaining brackets on the root 
and checking where the interpolation would land before carrying out the division. 
When the correction P/Q would not land within the bounds, or when the bounds 
are not collapsing rapidly enough, the algorithm takes a bisection step. Thus, 
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Brent’s method combines the sureness of bisection with the speed of a higher-order 
method when appropriate. We recommend it as the method of choice for general 
one-dimensional root finding where a function’s values only (and not its derivative 
or functional form) are available. 

#include <math.h> 

#include "nrutil.h" 

#define ITMAX 100 Maximum allowed number of iterations, 

tdefine EPS 3.0e-8 Machine floating-point precision. 

float zbrent(float (*func)(float), float xl, float x2, float tol) 

Using Brent's method, find the root of a function func known to lie between xl and x2. The 
root, returned as zbrent, will be refined until its accuracy is tol. 

{ 

int iter; 

float a=xl,b=x2,c=x2,d,e,minl,min2; 

float fa=(*func) (a) ,fb=(*func) (b) jf^p.q.r^.tolljXm; 

if ((fa > 0.0 kk fb > 0.0) II (fa < 0.0 kk fb < 0.0)) 
nrerror("Root must be bracketed in zbrent"); 
fc=fb; 

for (iter=l;iter<=ITMAX;iter++) { 

if ((fb > 0.0 kk fc > 0.0) II (fb < 0.0 kk fc < 0.0)) { 

c=a; Rename a, b, c and adjust bounding interval 

fc=fa; d. 

e=d=b-a; 

> 

if (fabs(fc) < fabs(fb)) { 
a=b; 
b=c; 
c=a; 
fa=fb; 
fb=fc; 
fc=fa; 

> 

toll=2.0*EPS*fabs(b)+0.5*tol; Convergence check. 

xm=0.5*(c-b); 

if (fabs(xm) <= toll I I fb == 0.0) return b; 
if (fabs(e) >= toll kk fabs(fa) > fabs(fb)) { 

s=fb/fa; Attempt inverse quadratic interpolation, 

if (a == c) { 
p=2.0*xm*s; 
q=l.0-s; 

> else { 

q=fa/fc; 
r=fb/fc; 

p=s*(2.0*xm*q*(q-r)-(b-a)*(r-1.0)); 
q=(q-l.0)*(r-1.0)*(s-1.0); 

> 

if (p > 0.0) q = -q; Check whether in bounds. 

p=fabs(p); 

minl=3.0*xm*q-fabs(toll*q); 
min2=fabs(e*q); 

if (2.0*p < (mini < min2 ? mini : min2)) { 

e=d; Accept interpolation. 

d=p/q; 

> else { 

d=xm; Interpolation failed, use bisection. 

e=d; 

> 

} else { 
d=xm; 
e=d; 



Bounds decreasing too slowly, use bisection. 
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a=b; Move last best guess to a. 

fa=fb; 

if (fabs(d) > toll) Evaluate new trial root, 

b += d; 


b += SIGN(toll,xm); 
fb=(*func)(b); 

> 

nrerror("Maximum number of iterations exceeded in zbrent"); 
return 0.0; Never get here. 



CITED REFERENCES AND FURTHER READING: 

Brent, R.R 1973, Algorithms for Minimization without Derivatives (Englewood Cliffs, NJ: Prentice- 
Hall), Chapters 3, 4. [1] 

Forsythe, G.E., Malcolm, M.A., and Moler, C.B. 1977, Computer Methods for Mathematical 
Computations (Englewood Cliffs, NJ: Prentice-Hall), §7.2. 


9.4 Newton-Raphson Method Using Derivative 



Perhaps the most celebrated of all one-dimensional root-finding routines is New¬ 
ton ’s method, also called the Newton-Raphson method. This method is distinguished 
from the methods of previous sections by the fact that it requires the evaluation 
of both the function f(x), and the derivative at arbitrary points x. The 

Newton-Raphson formula consists geometrically of extending the tangent line at a 
current point Xi until it crosses zero, then setting the next guess x t+ j to the abscissa 
of that zero-crossing (see Figure 9.4.1). Algebraically, the method derives from the 
familiar Taylor series expansion of a function in the neighborhood of a point, 

f(x + 5)^ f(x) + f'(x)S + ffiU 2 + .... (9.4.1) 

For small enough values of S, and for well-behaved functions, the terms beyond 
linear are unimportant, hence f(x + <5) = 0 implies 


5 = - 


f O) 
/'(*) ‘ 


(9.4.2) 


Newton-Raphson is not restricted to one dimension. The method readily 
generalizes to multiple dimensions, as we shall see in §9.6 and §9.7, below. 

Far from a root, where the higher-order terms in the series are important, the 
Newton-Raphson formula can give grossly inaccurate, meaningless corrections. For 
instance, the initial guess for the root might be so far from the true root as to let 
the search interval include a local maximum or minimum of the function. This can 
be death to the method (see Figure 9.4.2). If an iteration places a trial guess near 
such a local extremum, so that the first derivative nearly vanishes, then Newton- 
Raphson sends its solution off to limbo, with vanishingly small hope of recovery. 
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Figure 9.4.3. Unfortunate case where Newton’s method enters a nonconvergent cycle. This behavior 
is often encountered when the function / is obtained, in whole or in part, by table interpolation. With 
a better initial guess, the method would have succeeded. 

Like most powerful tools, Newton-Raphson can be destructive used in inappropriate 
circumstances. Figure 9.4.3 demonstrates another possible pathology. 

Why do we call Newton-Raphson powerful? The answer lies in its rate of 
convergence: Within a small distance e of x the function and its derivative are 
approximately: 


f( x + e) — f{x) + ef'(x) + e 2 ^ ^ ^ + '' ’ > 
f(x + e) = f'(x) + ef"(x) + ■ ■ ■ 


(9.4.3) 



(9.4.4) 


(9.4.5) 


When a trial solution x t differs from the true root by e,, we can use (9.4.3) to express 
f(xi),f'(xi) in (9.4.4) in terms of e, and derivatives at the root itself. The result is 
a recurrence relation for the deviations of the trial solutions 



f"(x) 

m*) 



(9.4.6) 


Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 



9.4 Newton-Raphson Method Using Derivative 


365 


Equation (9.4.6) says that Newton-Raphson converges quadratically (cf. equa¬ 
tion 9.2.3). Near a root, the number of significant digits approximately doubles 
with each step. This very strong convergence property makes Newton-Raphson the 
method of choice for any function whose derivative can be evaluated efficiently, and 
whose derivative is continuous and nonzero in the neighborhood of a root. 

Even where Newton-Raphson is rejected for the early stages of convergence 
(because of its poor global convergence properties), it is very common to “polish 
up” a root with one or two steps of Newton-Raphson, which can multiply by two 
or four its number of significant figures! 

For an efficient realization of Newton-Raphson the user provides a routine that 
evaluates both f(x ) and its first derivative fix) at the point x. The Newton-Raphson 
formula can also be applied using a numerical difference to approximate the true 
local derivative, 


fix) « /( x + dx ) ~ f( x ) 
dx 


(9.4.7) 


This is not, however, a recommended procedure for the following reasons: (i) You 
are doing two function evaluations per step, so at best the superlinear order of 
convergence will be only \f2. (ii) If you take dx too small you will be wiped out 
by roundoff, while if you take it too large your order of convergence will be only 
linear, no better than using the initial evaluation f'(x o) for all subsequent steps. 
Therefore, Newton-Raphson with numerical derivatives is (in one dimension) always 
dominated by the secant method of §9.2. (In multidimensions, where there is a 
paucity of available methods, Newton-Raphson with numerical derivatives must be 
taken more seriously. See §§9.6-9.7.) 

The following function calls a user supplied function funcd(x,fn,df) which 
supplies the function value as fn and the derivative as df. We have included input 
bounds on the root simply to be consistent with previous root-finding routines: 
Newton does not adjust bounds, and works only on local information at the point 
x. The bounds are used only to pick the midpoint as the first guess, and to reject 
the solution if it wanders outside of the bounds. 


#include <math.h> 

#define JMAX 20 Set to maximum number of iterations. 

float rtnewt(void (*funcd)(float, float *, float *), float xl, float x2, 
float xacc) 

Using the Newton-Raphson method, find the root of a function known to lie in the interval 
[xl,x2]. The root rtnewt will be refined until its accuracy is known within ±xacc. funcd 
is a user-supplied routine that returns both the function value and the first derivative of the 
function at the point x. 

{ 

void nrerrorfchar error_text[]); 
int j; 

float df,dx,f,rtn; 

rtn=0.5*(xl+x2); Initial guess, 

for (j=l;j<=JMAX;j++) { 

(*funcd)(rtn,&f,&df); 
dx=f/df; 
rtn -= dx; 

if ((xl-rtn)*(rtn-x2) < 0.0) 

nr error("Jumped out of brackets in rtnewt"); 
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if (fabs(dx) < xacc) return rtn; Convergence. 

> 

nrerror("Maximum number of iterations exceeded in rtnewt"); 
return 0.0; Never get here. 


While Newton-Raphson’s global convergence properties are poor, it is fairly 
easy to design a fail-safe routine that utilizes a combination of bisection and Newton- 
Raphson. The hybrid algorithm takes a bisection step whenever Newton-Raphson 
would take the solution out of bounds, or whenever Newton-Raphson is not reducing 
the size of the brackets rapidly enough. 


#include <math.h> 

#define MAXIT 100 Maximum allowed number of iterations. 


float rtsafe(void (♦funcd)(float, float *, float ♦ ), float xl, float x2, 
float xacc) 

Using a combination of Newton-Raphson and bisection, find the root of a function bracketed 
between xl and x2. The root, returned as the function value rtsafe, will be refined until 
its accuracy is known within ±xacc. funcd is a user-supplied routine that returns both the 
function value and the first derivative of the function. 

{ 

void nrerror(char error_text []); 
int j; 

float df,dx,dxold,f,fh,f1; 
float temp,xh,xl,rts; 


(♦funcd)(xl,&f1,&df); 

(♦funcd)(x2,&fh,&df); 

if ((fl > 0.0 kk fh > 0.0) II (fl < 0.0 kk fh < 0.0)) 
nrerror("Root must be bracketed in rtsafe"); 
if (fl == 0.0) return xl; 
if (fh == 0.0) return x2; 

if (fl < 0.0) { Orient the search so that /(xl) < 0. 

xl=xl; 
xh=x2; 

> else { 
xh=xl; 


xl=x2; 

> 

rts=0.5*(xl+x2); 
dxold=fabs(x2-xl); 
dx=dxold; 

(♦funcd)(rts,&f,&df); 
for (j=l;j<=MAXIT;j++) { 


Initialize the guess for root, 
the “stepsize before last,” 
and the last step. 


Loop over allowed iterations. 

Bisect if Newton out of range, 
or not decreasing fast enough. 


if ((((rts-xh)^df-f)^((rts-xl)+df-f) > 0.0) 

I I (fabs(2.0*f) > fabs(dxold^df))) { 
dxold=dx; 
dx=0.5* (xh-xl); 
rts=xl+dx; 

if (xl == rts) return rts; Change in root is negligible. 

> else { Newton step acceptable. Take it. 

dxold=dx; 
dx=f/df; 
temp=rts; 
rts -= dx; 


if (temp == rts) return rts; 

> 

if (fabs(dx) < xacc) return rts; Convergence criterion, 

(♦funcd)(rts,&f,&df); 

The one new function evaluation per iteration. 
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if (f < 0.0) Maintain the bracket on the root. 

xl=rts; 

else 

xh=rts; 

> 

nrerror("Maximum number of iterations exceeded in rtsafe"); 
return 0.0; Never get here. 


For many functions the derivative f'(x) often converges to machine accuracy 
before the function f(x) itself does. When that is the case one need not subsequently 
update f'(x). This shortcut is recommended only when you confidently understand 
the generic behavior of your function, but it speeds computations when the derivative 
calculation is laborious. (Formally this makes the convergence only linear, but if the 
derivative isn’t changing anyway, you can do no better.) 

Newton-Raphson and Fractals 

An interesting sidelight to our repeated warnings about Newton-Raphson’s 
unpredictable global convergence properties — its very rapid local convergence 
notwithstanding — is to investigate, for some particular equation, the set of starting 
values from which the method does, or doesn’t converge to a root. 

Consider the simple equation 


z 3 -l=0 (9.4.8) 

whose single real root is z = 1, but which also has complex roots at the other two 
cube roots of unity, exp(±27ri/3). Newton’s method gives the iteration 



Up to now, we have applied an iteration like equation (9.4.9) only for real 
starting values zo, but in fact all of the equations in this section also apply in the 
complex plane. We can therefore map out the complex plane into regions from which 
a starting value zq, iterated in equation (9.4.9), will, or won’t, converge to 2=1. 
Naively, we might expect to find a “basin of convergence” somehow surrounding 
the root z = 1. We surely do not expect the basin of convergence to fill the whole 
plane, because the plane must also contain regions that converge to each of the two 
complex roots. In fact, by symmetry, the three regions must have identical shapes. 
Perhaps they will be three symmetric 120° wedges, with one root centered in each? 

Now take a look at Figure 9.4.4, which shows the result of a numerical 
exploration. The basin of convergence does indeed cover 1/3 the area of the complex 
plane, but its boundary is highly irregular — in fact, fractal. (A fractal, so called, 
has self-similar structure that repeats on all scales of magnification.) How does this 
fractal emerge from something as simple as Newton’s method, and an equation as 
simple as (9.4.8)? The answer is already implicit in Figure 9.4.2, which showed how, 
on the real line, a local extremum causes Newton’s method to shoot off to infinity. 
Suppose one is slightly removed from such a point. Then one might be shot off 
not to infinity, but — by luck — right into the basin of convergence of the desired 
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Figure 9.4.4. The complex z plane with real and imaginary components in the range (—2, 2). The 
black region is the set of points from which Newton’s method converges to the root z = 1 of the equation 
z 3 — 1 = 0. Its shape is fractal. 

root. But that means that in the neighborhood of an extremum there must be a tiny, 
perhaps distorted, copy of the basin of convergence — a kind of “one-bounce away” 
copy. Similar logic shows that there can be “two-bounce” copies, “three-bounce” 
copies, and so on. A fractal thus emerges. 

Notice that, for equation (9.4.8), almost the whole real axis is in the domain of 
convergence for the root z = 1. We say “almost” because of the peculiar discrete 
points on the negative real axis whose convergence is indeterminate (see figure). 
What happens if you start Newton’s method from one of these points? (Try it.) 
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9.5 Roots of Polynomials 

Here we present a few methods for finding roots of polynomials. These will 
serve for most practical problems involving polynomials of low-to-moderate degree 
or for well-conditioned polynomials of higher degree. Not as well appreciated as it 
ought to be is the fact that some polynomials are exceedingly ill-conditioned. The 
tiniest changes in a polynomial’s coefficients can, in the worst case, send its roots 
sprawling all over the complex plane. (An infamous example due to Wilkinson is 
detailed by Acton [1].) 

Recall that a polynomial of degree n will have n roots. The roots can be real 
or complex, and they might not be distinct. If the coefficients of the polynomial are 
real, then complex roots will occur in pairs that are conjugate, i.e., if x \ = a + bi 
is a root then X 2 = a — bi will also be a root. When the coefficients are complex, 
the complex roots need not be related. 

Multiple roots, or closely spaced roots, produce the most difficulty for numerical 
algorithms (see Figure 9.5.1). For example, P(x ) = (x — a) 2 has a double real root 
at x = a. However, we cannot bracket the root by the usual technique of identifying 
neighborhoods where the function changes sign, nor will slope-following methods 
such as Newton-Raphson work well, because both the function and its derivative 
vanish at a multiple root. Newton-Raphson may work, but slowly, since large 
roundoff errors can occur. When a root is known in advance to be multiple, then 
special methods of attack are readily devised. Problems arise when (as is generally 
the case) we do not know in advance what pathology a root will display. 

Deflation of Polynomials 

When seeking several or all roots of a polynomial, the total effort can be 
significantly reduced by the use of deflation. As each root r is found, the polynomial 
is factored into a product involving the root and a reduced polynomial of degree 
one less than the original, i.e., P(x) = (x — r)Q(x). Since the roots of Q are 
exactly the remaining roots of P, the effort of finding additional roots decreases, 
because we work with polynomials of lower and lower degree as we find successive 
roots. Even more important, with deflation we can avoid the blunder of having our 
iterative method converge twice to the same (nonmultiple) root instead of separately 
to two different roots. 

Deflation, which amounts to synthetic division, is a simple operation that acts 
on the array of polynomial coefficients. The concise code for synthetic division by a 
monomial factor was given in §5.3 above. You can deflate complex roots either by 
converting that code to complex data type, or else — in the case of a polynomial with 
real coefficients but possibly complex roots — by deflating by a quadratic factor, 

[x — (a + ib )] [a: — (a — ib)\ = x 2 — 2 ax + (a 2 + b 2 ) (9.5.1) 

The routine poldiv in §5.3 can be used to divide the polynomial by this factor. 

Deflation must, however, be utilized with care. Because each new root is known 
with only finite accuracy, errors creep into the determination of the coefficients of 
the successively deflated polynomial. Consequently, the roots can become more and 
more inaccurate. It matters a lot whether the inaccuracy creeps in stably (plus or 



S, § g 
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Figure 9.5.1. (a) Linear, quadratic, and cubic behavior at the roots of polynomials. Only under high 

magnification (b) does it become apparent that the cubic has one, not three, roots, and that the quadratic 
has two roots rather than none. 


minus a few multiples of the machine precision at each stage) or unstably (erosion of 
successive significant figures until the results become meaningless). Which behavior 
occurs depends on just how the root is divided out. Forward deflation, where the 
new polynomial coefficients are computed in the order from the highest power of x 
down to the constant term, was illustrated in §5.3. This turns out to be stable if the 
root of smallest absolute value is divided out at each stage. Alternatively, one can do 
backward deflation, where new coefficients are computed in order from the constant 
term up to the coefficient of the highest power of x. This is stable if the remaining 
root of largest absolute value is divided out at each stage. 

A polynomial whose coefficients are interchanged “end-to-end,” so that the 
constant becomes the highest coefficient, etc., has its roots mapped into their 
reciprocals. (Proof: Divide the whole polynomial by its highest power x n and 
rewrite it as a polynomial in 1/a;.) The algorithm for backward deflation is therefore 
virtually identical to that of forward deflation, except that the original coefficients are 
taken in reverse order and the reciprocal of the deflating root is used. Since we will 
use forward deflation below, we leave to you the exercise of writing a concise coding 
for backward deflation (as in §5.3). For more on the stability of deflation, consult [2], 

To minimize the impact of increasing errors (even stable ones) when using 
deflation, it is advisable to treat roots of the successively deflated polynomials as 
only tentative roots of the original polynomial. One then polishes these tentative roots 
by taking them as initial guesses that are to be re-solved for, using the nondeflated 
original polynomial P. Again you must beware lest two deflated roots are inaccurate 
enough that, under polishing, they both converge to the same undeflated root; in that 
case you gain a spurious root-multiplicity and lose a distinct root. This is detectable, 
since you can compare each polished root for equality to previous ones from distinct 
tentative roots. When it happens, you are advised to deflate the polynomial just 
once (and for this root only), then again polish the tentative root, or to use Maehly’s 
procedure (see equation 9.5.29 below). 

Below we say more about techniques for polishing real and complex-conjugate 
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tentative roots. First, let’s get back to overall strategy. 

There are two schools of thought about how to proceed when faced with a 
polynomial of real coefficients. One school says to go after the easiest quarry, the 
real, distinct roots, by the same kinds of methods that we have discussed in previous 
sections for general functions, i.e., trial-and-error bracketing followed by a safe 
Newton-Raphson as in rtsaf e. Sometimes you are only interested in real roots, in 
which case the strategy is complete. Otherwise, you then go after quadratic factors 
of the form (9.5.1) by any of a variety of methods. One such is Bairstow’s method, 
which we will discuss below in the context of root polishing. Another is Muller’s 
method, which we here briefly discuss. 

Muller’s Method 

Muller’s method generalizes the secant method, but uses quadratic interpolation 
among three points instead of linear interpolation between two. Solving for the 
zeros of the quadratic allows the method to find complex pairs of roots. Given three 
previous guesses for the root .z',_2, x-i-i, %i, and the values of the polynomial P(x') 
at those points, the next approximation x t+ i is produced by the following formulas, 


Xi - Xi-i 

q = - 

Xj_i - Xi_2 

A = qP(xi ) - q( 1 + q)P(xi- 1 ) + q 2 P(xi - 2 ) 
B=(2q+ 1 )P( Xi ) - (1 + q) 2 P(xi- 1 ) + q 2 P(xi- 2 ) 
C={l + q)P{ Xi ) 


followed by 



Xi+\ =Xi~ (Xi - Xi- 1) 


2 C 


B ± y/B 2 - 4 AC 


(9.5.3) 


where the sign in the denominator is chosen to make its absolute value or modulus 
as large as possible. You can start the iterations with any three values of x that you 
like, e.g., three equally spaced values on the real axis. Note that you must allow 
for the possibility of a complex denominator, and subsequent complex arithmetic, 
in implementing the method. 

Muller’s method is sometimes also used for finding complex zeros of analytic 
functions (not just polynomials) in the complex plane, for example in the IMSL 
routine ZANLY [3], 


Laguerre’s Method 



The second school regarding overall strategy happens to be the one to which we 
belong. That school advises you to use one of a very small number of methods that 
will converge (though with greater or lesser efficiency) to all types of roots: real, 
complex, single, or multiple. Use such a method to get tentative values for all n 
roots of your nth degree polynomial. Then go back and polish them as you desire. 
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Laguerre’s method is by far the most straightforward of these general, complex 
methods. It does require complex arithmetic, even while converging to real roots; 
however, for polynomials with all real roots, it is guaranteed to converge to a 
root from any starting point. For polynomials with some complex roots, little is 
theoretically proved about the method’s convergence. Much empirical experience, 
however, suggests that nonconvergence is extremely unusual, and, further, can almost 
always be fixed by a simple scheme to break a nonconverging limit cycle. (This is 
implemented in our routine, below.) An example of a polynomial that requires this 
cycle-breaking scheme is one of high degree (> 20), with all its roots just outside of 
the complex unit circle, approximately equally spaced around it. When the method 
converges on a simple complex zero, it is known that its convergence is third order. 

In some instances the complex arithmetic in the Laguerre method is no 
disadvantage, since the polynomial itself may have complex coefficients. 

To motivate (although not rigorously derive) the Laguerre formulas we can note 
the following relations between the polynomial and its roots and derivatives 


Pn(x) 

ln|P„(*)l 

din |P n (x)| 
dx 

d 2 In \P n (x)\ 
dx 2 


(x — xi)(x — X 2 ) ■ • • (x — x n ) (9.5.4) 

In \x — xi \ + In |a: — x 2 \ + ... + In \x — x n \ (9.5.5) 


x — Xi 

1 


+ - 


1 P' 

- = -^ = G (9.5.6 

x — x n P n 

1 


(x - Xi) 2 (x - x 2 ) 


(x - x n ) 2 


P% 

P» 


= H 


(9.5.7) 


Starting from these relations, the Laguerre formulas make what Acton [1 ] nicely calls 
“a rather drastic set of assumptions”: The root x\ that we seek is assumed to be 
located some distance a from our current guess x, while all other roots are assumed 
to be located at a distance b 


x — x\ = a ; x — Xi =b i = 2,3,...,n 
Then we can express (9.5.6), (9.5.7) as 

G 
H 

which yields as the solution for a 


1 n — 1 
a* + ~lP~ 


n 

G ± - G 2 ) 


(9.5.8) 


(9.5.9) 

(9.5.10) 


(9.5.11) 



where the sign should be taken to yield the largest magnitude for the denominator. 
Since the factor inside the square root can be negative, a can be complex. (A more 
rigorous justification of equation 9.5.11 is in [4].) 
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The method operates iteratively: For a trial value x, a is calculated by equation 
(9.5.11). Then x — a becomes the next trial value. This continues until a is 
sufficiently small. 

The following routine implements the Laguerre method to find one root of a 
given polynomial of degree m, whose coefficients can be complex. As usual, the first 
coefficient a [0] is the constant term, while a [m] is the coefficient of the highest 
power of x. The routine implements a simplified version of an elegant stopping 
criterion due to Adams [5], which neatly balances the desire to achieve full machine 
accuracy, on the one hand, with the danger of iterating forever in the presence of 
roundoff error, on the other. 

#include <math.h> 

#include 11 complex, h" 

#include "nrutil.h" 

#define EPSS 1.0e-7 
#define MR 8 
#def ine MT 10 
#define MAXIT (MT*MR) 

Here EPSS is the estimated fractional roundoff error. We try to break (rare) limit cycles with 
MR different fractional values, once every MT steps, for MAXIT total allowed iterations. 

void laguer(fcomplex a[], int m, fcomplex *x, int *its) 

Given the degree m and the m+1 complex coefficients a[0. .m] of the polynomial V®_ 0 a’rrc*, 
and given a complex value x, this routine improves x by Laguerre's method until it converges, 
within the achievable roundoff limit, to a root of the given polynomial. The number of iterations 
taken is returned as its. 

{ 

int iter,j; 

float abx,abp,abm,err; 
fcomplex dx,xl,b > d > f ) g,h ) sq ) gp ) gm ) g2; 

static float frac[MR+l] = {0.0,0.5,0.25,0.75,0.13,0.38,0.62,0.88,1.0}; 
Fractions used to break a limit cycle. 

for (iter=l;iter<=MAXIT;iter++) { Loop over iterations up to allowed maximum. 

*its=iter; 
b=a[m] ; 
err=Cabs(b); 
d=f=Complex(0.0,0.0); 
abx=Cabs(*x); 

for (j=m-l; j>=0; j—) { Efficient computation of the polynomial and 

f=Cadd(Cmul(*x,f) ,d) ; its first two derivatives, f stores P"/ 2. 

d=Cadd(Cmul(*x,d),b); 
b=Cadd(Cmul(*x,b),a[j]); 
err=Cabs(b)+abx*err; 

> 

err *= EPSS; 

Estimate of roundoff error in evaluating polynomial, 
if (Cabs(b) <= err) return; We are on the root. 

g=Cdiv(d,b); The generic case: use Laguerre’s formula. 

g2=Cmul(g,g); 

h=Csub(g2,RCmul(2.0,Cdiv(f,b))); 

sq=Csqrt(RCmul((float) (m-1),Csub(RCmul((float) m,h),g2))); 

gp=Cadd(g,sq); 

gm=Csub(g,sq); 

abp=Cabs(gp); 

abm=Cabs(gm); 

if (abp < abm) gp=gm; 

dx=((FMAX(abp,abm) > 0.0 ? Cdiv(Complex((float) m,0.0),gp) 

: RCmul(1+abx,Complex(cos((float)iter),sin((float)iter))))); 
xl=Csub(*x,dx); 



Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 




374 


Chapter 9. Root Finding and Nonlinear Sets of Equations 


if (x->r == xl.r &fc x->i == xl.i) return; Converged, 

if (iter "/, MT) *x=xl; 

else *x=Csub(*x,RCmul(frac[iter/MT],dx)); 

Every so often we take a fractional step, to break any limit cycle (itself a rare occur¬ 
rence). 

> 

nrerrorO'too many iterations in laguer"); 

Very unusual — can occur only for complex roots. Try a different starting guess for the 

root. 

return; 


Here is a driver routine that calls laguer in succession for each root, performs 
the deflation, optionally polishes the roots by the same Laguerre method — if you 
are not going to polish in some other way — and finally sorts the roots by their real 
parts. (We will use this routine in Chapter 13.) 


#include <math.h> 

#include 11 complex, h" 

#define EPS 2.0e-6 
#define MAXM 100 

A small number, and maximum anticipated value of m. 


void zroots(fcomplex a[], int m, fcomplex roots[] , int polish) 

Given the degree m and the m+1 complex coefficients a[0. .m] of the polynomial aft)**, 

this routine successively calls laguer and finds all m complex roots in roots [1. ,m]. The 
boolean variable polish should be input as true (1) if polishing (also by Laguerre’s method) 
is desired, false (0) if the roots will be subsequently polished by other means. 

{ 

void laguer(fcomplex a[], int m, fcomplex *x, int *its); 
int i,its, j, j j ; 
fcomplex x,b,c,ad[MAXM]; 


for (j=0;j<=m;j++) ad[j]=a[j] ; 
for (j=m;j>=l;j—) { 
x=Complex(0.0,0.0); 
laguer(ad, j ,&x,Slits); 
if (fabs(x.i) <= 2.0*EPS*fabs(x. 
roots[j]=x; 
b=ad[j]; 

for Cjj=j-l;jj>=0;jj—) { 

c=ad[j j] ; 
ad[jj]=b; 

b=Cadd(Cmul(x,b),c); 

} 

} 

if (polish) 

for (j=l;j<=m;j++) 

laguer (a,m,S£roots [j] ,Ssits); 
for (j=2;j<=m;j++) { 
x=roots[j]; 

for (i=j-l;i>=l;i—) { 

if (roots[i].r <= x.r) break; 
roots[i+1]=roots [i]; 

} 

roots[i+1]=x; 


Copy of coefficients for successive deflation. 
Loop over each root to be found. 

Start at zero to favor convergence to small¬ 
est remaining root, and find the root. 
■)) x.i=0.0; 

Forward deflation. 


Polish the roots using the undeflated coeffi¬ 
cients. 

Sort roots by their real parts by straight in¬ 
sertion. 



> 


} 
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Eigenvalue Methods 


The eigenvalues of a matrix A are the roots of the “characteristic polynomial” 
P(x) = det[A — x\\. However, as we will see in Chapter 11, root-finding is not 
generally an efficient way to find eigenvalues. Turning matters around, we can 
use the more efficient eigenvalue methods that are discussed in Chapter 11 to find 
the roots of arbitrary polynomials. You can easily verify (see, e.g., [ 6 ]) that the 
characteristic polynomial of the special m x m companion matrix 



is equivalent to the general polynomial 


P(x) = Y^a i x i (9.5.13) 

<=o 

If the coefficients a* are real, rather than complex, then the eigenvalues of A can be 
found using the routines balanc and hqr in §§ 11.5-11.6 (see discussion there). This 
method, implemented in the routine zrhqr following, is typically about a factor 2 
slower than zroots (above). However, for some classes of polynomials, it is a more 
robust technique, largely because of the fairly sophisticated convergence methods 
embodied in hqr. If your polynomial has real coefficients, and you are having 
trouble with zroots, then zrhqr is a recommended alternative. 

#include "nrutil.h" 

#def ine MAXM 50 

void zrhqr (float a[] , int m, float rtr[], float rti[]) 

Find all the roots of a polynomial with real coefficients, o a (®) a: *’ given the degree m 
and the coefficients a[0. .m] . The method is to construct an upper Hessenberg matrix whose 
eigenvalues are the desired roots, and then use the routines balanc and hqr. The real and 
imaginary parts of the roots are returned in rtr[l. .m] and rti[l. .m] , respectively. 

{ 

void balanc(float **a, int n); 

void hqr(float **a, int n, float wr[], float wi[]); 
int j,k; 

float **hess,xr,xi; 
hess=matrix(l,MAXM,1,MAXM); 

if (m > MAXM I I a[m] == 0.0) nrerror("bad args in zrhqr"); 
for (k=l ;k<=m;k++) { Construct the matrix. 

hess[l][k] = -a [m-k]/a [m] ; 
for (j=2;j<=m; j++) hess[j] [k] =0.0; 
if (k ! = m) hess [k+1] [k] =1.0; 

> 

balanc (hess,m) ; Find its eigenvalues. 

hqr(hess,m,rtr,rti); 

for (j=2; j<=m; j++) { Sort roots by their real parts by straight insertion. 

xr=rtr[j]; 
xi=rti [j] ; 



Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 



376 


Chapter 9. Root Finding and Nonlinear Sets of Equations 


for (k=j-l;k>=l;k—) { 

if (rtr[k] <= xr) break; 
rtr [k+1] =rtr [k] ; 
rti [k+1] =rti [k] ; 

> 

rtr [k+l]=xr; 
rti[k+1]=xi; 

> 

free_matrix(hess,l.MAXM,l.MAXM); 


Other Sure-Fire Techniques 

The Jenkins-Traub method has become practically a standard in black-box 
polynomial root-finders, e.g., in the IMSL library [3], The method is too complicated 
to discuss here, but is detailed, with references to the primary literature, in [4], 

The Lehmer-Schur algorithm is one of a class of methods that isolate roots in 
the complex plane by generalizing the notion of one-dimensional bracketing. It is 
possible to determine efficiently whether there are any polynomial roots within a 
circle of given center and radius. From then on it is a matter of bookkeeping to 
hunt down all the roots by a series of decisions regarding where to place new trial 
circles. Consult [1 ] for an introduction. 



Techniques for Root-Polishing 

Newton-Raphson works very well for real roots once the neighborhood of 
a root has been identified. The polynomial and its derivative can be efficiently 
simultaneously evaluated as in §5.3. For a polynomial of degree n with coefficients 
c [0] .. . c [n], the following segment of code embodies one cycle of Newton- 
Raphson: 

p=c[n]*x+c[n-l] ; 

pl=c [n] ; 

for(i=n-2;i>=0;i—) { 
pl=p+pl*x; 
p=c [i]+p*x; 

> 

if (pi == 0.0) nrerror("derivative should not vanish"); 

x -= p/pl; 

Once all real roots of a polynomial have been polished, one must polish the 
complex roots, either directly, or by looking for quadratic factors. 

Direct polishing by Newton-Raphson is straightforward for complex roots if the 
above code is converted to complex data types. With real polynomial coefficients, 
note that your starting guess (tentative root) must be off the real axis, otherwise 
you will never get off that axis — and may get shot off to infinity by a minimum 
or maximum of the polynomial. 



For real polynomials, the alternative means of polishing complex roots (or, for that 
matter, double real roots) is Bairstow's method, which seeks quadratic factors. The advantage 
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of going after quadratic factors is that it avoids all complex arithmetic. Bairstow’s method 
seeks a quadratic factor that embodies the two roots x = a ± ib, namely 

x 2 - 2ax + (a 2 + b 2 ) = x 2 + Bx + C (9.5.14) 

In general if we divide a polynomial by a quadratic factor, there will be a linear remainder 

P{x) = (x 2 + Bx + C)Q(x) + Rx + S. (9.5.15) 

Given B and C, R and S can be readily found, by polynomial division (§5.3). We can 
consider R and S to be adjustable functions of B and C, and they will be zero if the quadratic 
factor is a divisor of P(x). 

In the neighborhood of a root a first-order Taylor series expansion approximates the 
variation of R, S with respect to small changes in B,C 

R(B + SB,C + SC) fe R(B, C ) + "P-5B + ^SC (9.5.16) 

dB dC 


S(B + SB, C + SC) « S{B, C) + ==SB + —SC (9.5.17) 

oB a G 

To evaluate the partial derivatives, consider the derivative of (9.5.15) with respect to C. Since 
P{x) is a fixed polynomial, it is independent of C, hence 

0 = (x 2 + Bx + C) ( ^ + Q(x)+™x+^ (9.5.18) 

which can be rewritten as 

-Q(x) = (x 2 + Bx + C)^ + ^x+^ (9.5.19) 

Similarly, P(x) is independent of B, so differentiating (9.5.15) with respect to B gives 

-xQ(x) = (x 2 + Bx + C)V | + If* + || (9-5.20) 

Now note that equation (9.5.19) matches equation (9.5.15) in form. Thus if we perform a 
second synthetic division of P(x), i.e., a division of Q(x), yielding a remainder Rix+Si, then 

§*=-*> g = (9-5.21) 

To get the remaining partial derivatives, evaluate equation (9.5.20) at the two roots of the 
quadratic, x+ and x~. Since 


dR 

, dS 

-x+(Rix+ + S 

dB X 

+ + dB~~ 

dR 

— — X. 

dl = _ 

-x-(Rix- + S 

dB 

dB 


Solve these two equations for the partial derivatives, using 

x+ + x-=— B x+x-=C 

and find 

8R 8S 

8B =BRl ~ Sl 8 B =CRi 


Bairstow’s method now consists of using Newton-Raphson in two dimensions (which is 
actually the subject of the next section) to find a simultaneous zero of R and S. Synthetic 
division is used twice per cycle to evaluate R, S and their partial derivatives with respect to 
B, C. Like one-dimensional Newton-Raphson, the method works well in the vicinity of a root 
pair (real or complex), but it can fail miserably when started at a random point. We therefore 
recommend it only in the context of polishing tentative complex roots. 
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#include <math.h> 

#include "nrutil.h" 

#def ine UMAX 20 At most UMAX iterations. 

#define TINY 1.0e-6 

void qroot(float p[], int n, float *b, float *c, float eps) 

Given n+1 coefficients p [0. .n] of a polynomial of degree n, and trial values for the coefficients 
of a quadratic factor x*x+b*x+c, improve the solution until the coefficients b, c change by less 
than eps. The routine poldiv §5.3 is used. 

{ 

void poldivffloat u[] , int n, float v[], int nv, float q[] , float r[]); 
int iter; 

float sc,sb,s,rc,rb,r,dv,dele,delb; 
float *q,*qq,*rem; 
float d[3] ; 

q=vector(0,n); 
qq=vector(0,n); 
rem=vector(0,n); 
d[2] =1.0; 

for (iter=l;iter<=ITMAX;iter++) { 
d[l] = (*b) ; 
d[0] = (*c); 

poldiv(p,n,d,2,q,rem); 
s=rem[0] ; 
r=rem[l] ; 

poldivfq,(n-1),d,2,qq,rem); 
sb = -(*c)*(rc = -rem[l]); 
rb = -(*b)*rc+(sc = -rem[0]); 
dv=l.0/(sb*rc-sc*rb); 
delb=(r*sc-s*rc)*dv; 
delc= (-r*sb+s*rb) *dv; 

*b += (delb=(r*sc-s*rc)*dv); 

*c += (delc=(-r*sb+s*rb)*dv); 

if ((fabs(delb) <= eps*fabs(*b) I I fabs(*b) < TINY) 

kk (fabs(dele) <= eps*fabs(*c) I I fabs(*c) < TINY)) { 
free_vector(rem,0,n); Coefficients converged. 

free_vector(qq,0,n); 
free_vector(q,0,n); 
return; 

> 

> 

nrerrorC'Too many iterations in routine qroot"); 

> 


First division r,s. 

Second division partial r,s with respect to 
c. 

Solve 2x2 equation. 


We have already remarked on the annoyance of having two tentative roots 
collapse to one value under polishing. You are left not knowing whether your 
polishing procedure has lost a root, or whether there is actually a double root, 
which was split only by roundoff errors in your previous deflation. One solution 
is deflate-and-repolish; but deflation is what we are trying to avoid at the polishing 
stage. An alternative is Maehly’sprocedure. Maehly pointed out that the derivative 
of the reduced polynomial 


can be written as 


PM m 


P{x) 


(x — X\) • • • (x — Xj 


(9.5.27) 


P'Ax) = 


P'{x) 


P{x) 


(x — Xi) ■■■(* — Xj) (x — XI ) •••(*— Xj) 


(x — Xj) 
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Hence one step of Newton-Raphson, taking a guess Xk into a new guess Xk+i, 
can be written as 


%k+l 


= Xk~ 


_Pfa)_ 

P'(Xk) - P{x k ) Y^l=l(Xk - Xi )- 1 


(9.5.29) 


This equation, if used with i ranging over the roots already polished, will prevent a 
tentative root from spuriously hopping to another one’s true root. It is an example 
of so-called zero suppression as an alternative to true deflation. 

Muller’s method, which was described above, can also be useful at the polishing 
stage. 
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Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), Chapter 7. [1] 

Peters G., and Wilkinson, J.H. 1971, Journal of the Institute of Mathematics and its Applications, 
vol. 8, pp. 16-35. [2] 

IMSL Math/Library Users Manual (IMSL Inc., 2500 CityWest Boulevard, Houston TX 77042). [3] 

Ralston, A., and Rabinowitz, P. 1978, A First Course in Numerical Analysis, 2nd ed. (New York: 
McGraw-Hill), §8.9-8.13. [4] 

Adams, D.A. 1967, Communications of the ACM, vol. 10, pp. 655-658. [5] 

Johnson, L.W., and Riess, R.D. 1982, Numerical Analysis, 2nd ed. (Reading, MA: Addison- 
Wesley), §4.4.3. [6] 

Henrici, P. 1974, Applied and Computational Complex Analysis, vol. 1 (New York: Wiley). 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
§§5.5-5.9. 


9.6 Newton-Raphson Method for Nonlinear 
Systems of Equations 


We make an extreme, but wholly defensible, statement: There are no good, gen¬ 
eral methods for solving systems of more than one nonlinear equation. Furthermore, 
it is not hard to see why (very likely) there never will be any good, general methods: 
Consider the case of two dimensions, where we want to solve simultaneously 


f(x,y) =0 
g(x,y) =0 


(9.6.1) 


The functions / and g are two arbitrary functions, each of which has zero 
contour lines that divide the (x, y) plane into regions where their respective function 
is positive or negative. These zero contour boundaries are of interest to us. The 
solutions that we seek are those points (if any) that are common to the zero contours 
of / and g (see Figure 9.6.1). Unfortunately, the functions / and g have, in general, 
no relation to each other at all! There is nothing special about a common point from 
either /’s point of view, or from g’s. In order to find all common points, which are 
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Figure 9.6.1. Solution of two nonlinear equations in two unknowns. Solid curves refer to f(x,y), 
dashed curves to g(x,y). Each equation divides the (x,y) plane into positive and negative regions, 
bounded by zero curves. The desired solutions are the intersections of these unrelated zero curves. The 
number of solutions is a priori unknown. 

the solutions of our nonlinear equations, we will (in general) have to do neither more 
nor less than map out the full zero contours of both functions. Note further that 
the zero contours will (in general) consist of an unknown number of disjoint closed 
curves. How can we ever hope to know when we have found all such disjoint pieces? 

For problems in more than two dimensions, we need to find points mutually 
common to N unrelated zero-contour hypersurfaces, each of dimension N — 1. You 
see that root finding becomes virtually impossible without insight! You will almost 
always have to use additional information, specific to your particular problem, to 
answer such basic questions as, “Do I expect a unique solution?” and “Approximately 
where?” Acton [1 ] has a good discussion of some of the particular strategies that 
can be tried. 

In this section we will discuss the simplest multidimensional root finding 
method, Newton-Raphson. This method gives you a very efficient means of 
converging to a root, if you have a sufficiently good initial guess. It can also 
spectacularly fail to converge, indicating (though not proving) that your putative 
root does not exist nearby. In §9.7 we discuss more sophisticated implementations 
of the Newton-Raphson method, which try to improve on Newton-Raphson’s poor 
global convergence. A multidimensional generalization of the secant method, called 
Broyden’s method, is also discussed in §9.7. 

A typical problem gives N functional relations to be zeroed, involving variables 

Xi,i = 1,2,..., TV: 

F i (x 1 ,x 2 ,...,x N ) = 0 i=l,2,...,N. (9.6.2) 

We let x denote the entire vector of values Xi and F denote the entire vector of 
functions F). In the neighborhood of x, each of the functions F t can be expanded 
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in Taylor series 


Fi(\ + S\) = Fi(\) + ^2 + 0(Sx 2 ). (9.6.3) 

j= l (,x i 

The matrix of partial derivatives appearing in equation (9.6.3) is the Jacobian matrix J: 

dFi 


Ja = 


dx.j 


(9.6.4) 


In matrix notation equation (9.6.3) is 

F(x + Sx) = F(x) + J • Sx + 0(5x 2 ). (9.6.5) 

By neglecting terms of order Sx 2 and higher and by setting F(x + Sx) = 0, we 
obtain a set of linear equations for the corrections 6x that move each function closer 
to zero simultaneously, namely 


J • <5x = -F. (9.6.6) 

Matrix equation (9.6.6) can be solved by LU decomposition as described in 
§2.3. The corrections are then added to the solution vector, 

x new = x old + Sx (9.6.7) 

and the process is iterated to convergence. In general it is a good idea to check the 
degree to which both functions and variables have converged. Once either reaches 
machine accuracy, the other won’t change. 

The following routine mnewt performs ntr ial iterations starting from an initial 
guess at the solution vector x[l. .n]. Iteration stops if either the sum of the 
magnitudes of the functions is less than some tolerance tolf, or the sum of the 
absolute values of the corrections to Sxi is less than some tolerance tolx. mnewt 
calls a user supplied function usrf un which must provide the function values F and 
the Jacobian matrix J. If J is difficult to compute analytically, you can try having 
usrfun call the routine fdjac of §9.7 to compute the partial derivatives by finite 
differences. You should not make ntrial too big; rather inspect to see what is 
happening before continuing for some further iterations. 

#include <math.h> 
tinclude "nrutil.h" 

void usrfunffloat *x,int n,float *fvec,float **f jac); 

#define FREERETURN {free_matrix(fjac,1,n,1,n);free_vector(fvec,1,n);\ 
free_vector(p,1,n);free_ivector(indx,1,n);return;} 

void mnewt(int ntrial, float x[], int n, float tolx, float tolf) 

Given an initial guess x [1. .n] for a root in n dimensions, take ntrial Newton-Raphson steps 
to improve the root. Stop if the root converges in either summed absolute variable increments 
tolx or summed absolute function values tolf. 

{ 

void lubksb(float **a, int n, int *indx, float b[]); 
void ludcmp(float **a, int n, int *indx, float *d); 
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int k,i,*indx; 

float errx,errf,d,*fvec,**fj ac,*p; 

indx=ivector(l,n); 
p=vector(l,n); 
fvec=vector(1,n); 
fjac=matrix(l,n,1,n); 
for (k=l;k<=ntrial;k++) { 

usrfun(x,n,fvec,f jac) ; User function supplies function values at x in 

errf=0.0; fvec and Jacobian matrix in f jac. 

for (i=l;i<=n;i++) errf += fabs(fvec [i] ) ; Check function convergence, 

if (errf <= tolf) FREERETURN 

for (i=l;i<=n;i++) p[i] = -fvec[i]; Right-hand side of linear equations. 

ludcmp(f j ac, n, indx, fed); Solve linear equations using LU decomposition. 

lubksb(f jac,rL,indx,p) ; 

errx=0.0; Check root convergence, 

for (i=l;i<=n;i++) { Update solution, 

errx += fabs(p[i]); 
x[i] += p[i] ; 

> 

if (errx <= tolx) FREERETURN 

> 

FREERETURN 


Newton’s Method versus Minimization 

In the next chapter, we will find that there are efficient general techniques for 
finding a minimum of a function of many variables. Why is that task (relatively) 
easy, while multidimensional root finding is often quite hard? Isn’t minimization 
equivalent to finding a zero of an N -dimensional gradient vector, not so different from 
zeroing an TV-dimensional function? No! The components of a gradient vector are not 
independent, arbitrary functions. Rather, they obey so-called integrability conditions 
that are highly restrictive. Put crudely, you can always find a minimum by sliding 
downhill on a single surface. The test of “downhillness” is thus one-dimensional. 
There is no analogous conceptual procedure for finding a multidimensional root, 
where “downhill” must mean simultaneously downhill in N separate function spaces, 
thus allowing a multitude of trade-offs, as to how much progress in one dimension 
is worth compared with progress in another. 

It might occur to you to carry out multidimensional root finding by collapsing 
all these dimensions into one: Add up the sums of squares of the individual functions 
Fi to get a master function F which (i) is positive definite, and (ii) has a global 
minimum of zero exactly at all solutions of the original set of nonlinear equations. 
Unfortunately, as you will see in the next chapter, the efficient algorithms for finding 
minima come to rest on global and local minima indiscriminately. You will often 
find, to your great dissatisfaction, that your function F has a great number of local 
minima. In Figure 9.6.1, for example, there is likely to be a local minimum wherever 
the zero contours of / and g make a close approach to each other. The point labeled 
M is such a point, and one sees that there are no nearby roots. 

However, we will now see that sophisticated strategies for multidimensional 
root finding can in fact make use of the idea of minimizing a master function F, by 
combining it with Newton’s method applied to the full set of functions Fi. While 
such methods can still occasionally fail by coming to rest on a local minimum of 
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F, they often succeed where a direct attack via Newton’s method alone fails. The 
next section deals with these methods. 


CITED REFERENCES AND FURTHER READING: 

Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), Chapter 14. [1] 

Ostrowski, A.M. 1966, Solutions of Equations and Systems of Equations, 2nd ed. (New York: 
Academic Press). 

Ortega, J., and Rheinboldt, W. 1970, Iterative Solution of Nonlinear Equations in Several Vari¬ 
ables (New York: Academic Press). 


9.7 Globally Convergent Methods for Nonlinear 
Systems of Equations 

We have seen that Newton’s method for solving nonlinear equations has an 
unfortunate tendency to wander off into the wild blue yonder if the initial guess is 
not sufficiently close to the root. A global method is one that converges to a solution 
from almost any starting point. In this section we will develop an algorithm that 
combines the rapid local convergence of Newton’s method with a globally convergent 
strategy that will guarantee some progress towards the solution at each iteration. 
The algorithm is closely related to the quasi-Newton method of minimization which 
we will describe in §10.7. 

Recall our discussion of §9.6: the Newton step for the set of equations 

F(x) = 0 (9.7.1) 

is 

X new = X 0 ld + Sx (9.7.2) 

where 

6x = -J 1 • F (9.7.3) 

Here J is the Jacobian matrix. How do we decide whether to accept the Newton step 
<5x? A reasonable strategy is to require that the step decrease |F| 2 = F • F. This is 
the same requirement we would impose if we were trying to minimize 

/-4 F-F (9.7.4) 

(The \ is for later convenience.) Every solution to (9.7.1) minimizes (9.7.4), but 
there may be local minima of (9.7.4) that are not solutions to (9.7.1). Thus, as 
already mentioned, simply applying one of our minimum finding algorithms from 
Chapter 10 to (9.7.4) is not a good idea. 

To develop a better strategy, note that the Newton step (9.7.3) is a descent 
direction for /: 



V/ • <5x = (F • J) • (-J" 1 • F) = -F • F < 0 


(9.7.5) 
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Thus our strategy is quite simple: We always first try the full Newton step, 
because once we are close enough to the solution we will get quadratic convergence. 
However, we check at each iteration that the proposed step reduces /. If not, we 
backtrack along the Newton direction until we have an acceptable step. Because the 
Newton step is a descent direction for /, we are guaranteed to find an acceptable step 
by backtracking. We will discuss the backtracking algorithm in more detail below. 

Note that this method essentially minimizes / by taking Newton steps designed 
to bring F to zero. This is not equivalent to minimizing / directly by taking Newton 
steps designed to bring V/ to zero. While the method can still occasionally fail by 
landing on a local minimum of /, this is quite rare in practice. The routine newt 
below will warn you if this happens. The remedy is to try a new starting point. 

Line Searches and Backtracking 

When we are not close enough to the minimum of /, taking the full Newton step p = 5x 
need not decrease the function; we may move too far for the quadratic approximation to be 
valid. All we are guaranteed is that initially / decreases as we move in the Newton direction. 
So the goal is to move to a new point x new along the direction of the Newton step p, but 
not necessarily all the way: 

Xnew = x 0 id + Ap, 0 < A < 1 (9.7.6) 

The aim is to find A so that /(x 0 i d + Ap) has decreased sufficiently. Until the early 1970s, 
standard practice was to choose A so that x ne w exactly minimizes / in the direction p. However, 
we now know that it is extremely wasteful of function evaluations to do so. A better strategy 
is as follows: Since p is always the Newton direction in our algorithms, we first try A = 1, the 
full Newton step. This will lead to quadratic convergence when x is sufficiently close to the 
solution. However, if /(x ne w) does not meet our acceptance criteria, we backtrack along the 
Newton direction, trying a smaller value of A, until we find a suitable point. Since the Newton 
direction is a descent direction, we are guaranteed to decrease / for sufficiently small A. 

What should the criterion for accepting a step be? It is not sufficient to require merely 
that /(x new ) < /(x old). This criterion can fail to converge to a minimum of / in one of 
two ways. First, it is possible to construct a sequence of steps satisfying this criterion with 
/ decreasing too slowly relative to the step lengths. Second, one can have a sequence where 
the step lengths are too small relative to the initial rate of decrease of /. (For examples of 
such sequences, see[1], p. 117.) 

A simple way to fix the first problem is to require the average rate of decrease of / to 
be at least some fraction a of the initial rate of decrease V/ ■ p: 

/(Xnew) < /(Xold) + OV/ • (X new - X D ld) (9.7.7) 

Here the parameter a satisfies 0 < a < 1. We can get away with quite small values of 
a; a = 10 -4 is a good choice. 

The second problem can be fixed by requiring the rate of decrease of / at x, le w to be 
greater than some fraction (3 of the rate of decrease of / at Xoid. In practice, we will not 
need to impose this second constraint because our backtracking algorithm will have a built-in 
cutoff to avoid taking steps that are too small. 

Here is the strategy for a practical backtracking routine: Define 

9( A) = /(x old + Ap) (9.7.8) 

so that 

<?'(A) = V/ • p (9.7.9) 

If we need to backtrack, then we model g with the most current information we have and 
choose A to minimize the model. We start with g( 0) and </(()) available. The first step is 
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always the Newton step, A = 1. If this step is not acceptable, we have available g( 1) as well. 
We can therefore model g( A) as a quadratic: 

p(A) m [g( 1) - g{ o) - s'(0)]A 2 + ff'(0)A + 5 ( 0 ) 

Taking the derivative of this quadratic, we find that it is a minimum when 

fl'(O) 

2[5(1) - 5(0) - 5'(0)] 

Since the Newton step failed, we can show that A ^ \ for small a. We need to guard against 
too small a value of A, however. We set A m i n =0.1. 

On second and subsequent backtracks, we model 5 as a cubic in A, using the previous 
value p(Ai) and the second most recent value g( A 2 ): 

5 (A) = aA 3 + b \ 2 + g'( 0)A + 5 ( 0 ) (9.7.12) 


(9.7.10) 

(9.7.11) 


Requiring this expression to give the correct values of 5 at Ai and A 2 gives two equations 
that can be solved for the coefficients a and b: 


(9.7.13) 


al 1 r 1/A? 

—1/Ai- 

'5(Ai)-5'(0)Ai 

-9(0)' 

b \ Ai-A 2 |_-A 2 /A? 

Ai/A? 

,S(A 2 ) - 5'(0)A 2 

-9(0). 


The minimum of the cubic (9.7.12) is at 

~b + 3 « 5 '( 0 ) 


A = - 


3a 


(9.7.14) 


We enforce that A lie between A max = 0.5Ai and A m j n = O.lAi. 

The routine has two additional features, a minimum step length alamin and a maximum 
step length stpmax. lnsrch will also be used in the quasi-Newton minimization routine 
dfpmin in the next section. 


tinclude <math.h> 

#include "nrutil.h" 

#define ALF 1.0e-4 Ensures sufficient decrease in function value. 

#define TOLX 1.0e-7 Convergence criterion on Ax. 

void lnsrch(int n, float xold[], float fold, float g[] , float p[], float x[], 
float *f, float stpmax, int *check, float (*func)(float [])) 

Given an n-dimensional point xold[l. .n] , the value of the function and gradient there, fold 
and g [1. . n] , and a direction p [1. . n] , finds a new point x [1. .n] along the direction p from 
xold where the function func has decreased “sufficiently." The new function value is returned 
in f. stpmax is an input quantity that limits the length of the steps so that you do not try to 
evaluate the function in regions where it is undefined or subject to overflow, p is usually the 
Newton direction. The output quantity check is false (0) on a normal exit. It is true (1) when 
x is too close to xold. In a minimization algorithm, this usually signals convergence and can 
be ignored. However, in a zero-finding algorithm the calling program should check whether the 
convergence is spurious. Some “difficult” problems may require double precision in this routine. 
{ 

int i; 

float a,alam,alam2,alamin,b,disc,f2,rhsl,rhs2,slope,sum,temp, 
test,tmplam; 

*check=0; 

for (sum=0.0,i=l; i<=n;i++) sum += p[i]*p[i]; 
sum=sqrt(sum); 
if (sum > stpmax) 

for (i=l;i<=n;i++) p[i] *= stpmax/sum; Scale if attempted step is too big. 
for (slope=0.0, i=l; i<=n; i++) 
slope += g[i]*p[i]; 

if (slope >= 0.0) nrerror("Roundoff problem in lnsrch."); 
test=0.0; Compute A m i n . 

for (i=l;i<=n;i++) { 
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temp=fabs(p[i])/FMAX(fabs(xold[i]),1.0); 
if (temp > test) test=temp; 


x [i] =xold [i] +alam*p [i] ; 


Always try full Newton step first. 
Start of iteration loop. 


x[i]=xold[i] ; 


Convergence on Ax. For zero find¬ 
ing, the calling program should 
verify the convergence. 


Sufficient function decrease. 
Backtrack. 


> 

alamin=TOLX/test; 
alam=l.0; 
for (;;) { 

for (i=l;i<=n;i++) 

*f=(*func)(x); 
if (alam < alamin) { 
for (i=l; i<=n;i++) 

*check=l; 
return; 

> else if (*f <= fold+ALF*alam*slope) return; 
else { 

if (alam == 1.0) 

tmplam = -slope/(2.0*(*f-fold-slope)); 
else { 

rhsl = *f-fold-alam*slope; 
rhs2=f2-fold-alam2*slope; 

a=(rhsl/(alam*alam)-rhs2/(alam2*alam2))/(alam-alam2); 
b=(-alam2*rhsl/(alam*alam)+alam*rhs2/(alam2*alam2))/(alam-alam2) 
if (a == 0.0) tmplam = -slope/(2.0*b); 
else { 

disc=b*b-3.0*a*slope; 

if (disc < 0.0) tmplam=0.5*alam; 

else if (b <= 0.0) tmplam=(-b+sqrt(disc))/(3.0*a); 

else tmplam=-slope/(b+sqrt(disc)); 


First time. 

Subsequent backtracks. 


if (tmplam > 0.5*alam) 
tmplam=0.5*alam; 


A < 0.5Ai. 


alam2=alam; 
f2 = *f; 

alam=FMAX(tmplam,0.l*alam); 


A > O.lAi. 

Try again. 


Here now is the globally convergent Newton routine newt that uses lnsrch. A feature 
of newt is that you need not supply the Jacobian matrix analytically; the routine will attempt to 
compute the necessary partial derivatives of F by finite differences in the routine f djac. This 
routine uses some of the techniques described in §5.7 for computing numerical derivatives. Of 
course, you can always replace fdjac with a routine that calculates the Jacobian analytically 
if this is easy for you to do. 

#include <math.h> 

#include "nrutil.h" 

#define MAXITS 200 
#define T0LF 1.0e-4 
#define T0LMIN 1.0e-6 
#define T0LX 1.0e-7 
#define STPMX 100.0 

Here MAXITS is the maximum number of iterations; T0LF sets the convergence criterion on 
function values; T0LMIN sets the criterion for deciding whether spurious convergence to a 
minimum of fmin has occurred; T0LX is the convergence criterion on 5x; STPMX is the scaled 
maximum step length allowed in line searches. 



int nn; Global variables to communicate with fmin. 

float *fvec; 

void (*nrfuncv)(int n, float v[], float f []); 

#define FREERETURN ffree.vector(fvec,1,n);free.vector(xold,1,n);\ 
free_vector(p,1,n);free_vector(g,1,n);free_matrix(fjac,1,n,1,n) 
free_ivector(indx,1,n);return;} 
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void newt(float x[], int n, int *check, 

void (*vecfunc) (int, float [] , float [])) 

Given an initial guess x [1. . n] for a root in n dimensions, find the root by a globally convergent 
Newton's method. The vector of functions to be zeroed, called fvec[l. .n] in the routine 
below, is returned by the user-supplied routine vecfunc(n,x,fvec) . The output quantity 
check is false (0) on a normal return and true (1) if the routine has converged to a local 
minimum of the function fmin defined below. In this case try restarting from a different initial 
guess. 

{ 

void fdjac(int n, float x[] , float fvec[], float **df, 
void (*vecfunc)(int, float [] , float [] J) ; 
float fmin(float x[]); 

void lnsrch(int n, float xold[] , float fold, float g[] , float p[], float x[], 
float *f, float stpmax, int *check, float (*func)(float [])); 
void lubksb(float **a, int n, int *indx, float b[]); 
void ludcmp(float **a, int n, int *indx, float *d); 
int i,its,j,*indx; 

float d,den,f,fold,stpmax,sum,temp,test,**fjac,*g,*p,*xold; 


indx=ivector(l,n); 
fjac=matrix(l,n,1,n); 
g=vector(l,n); 
p=vector(l,n); 
xold=vector(1,n); 
fvec=vector(1,n); 
nn=n; 

nrf uncv=vecfunc; 
f=fmin(x); 
test=0.0; 

for (i=l;i<=n;i++) 


Define global variables. 


fvec is also computed by this call. 

Test for initial guess being a root. Use 
more stringent test than simply TOLF. 


if (fabs(fvec[i]) > test) test=fabs(fvec[i] ); 

if (test < 0.01*T0LF) { 

*check=0; 

FREERETURN 

> 

for (sum=0.0, i=l; i<=n; i++) sum += SQR(x[i]); Calculate stpmax for line searches. 

stpmax=STPMX*FMAX(sqrt(sum),(float)n); 

for (its=l;its<=MAXITS;its++) { Start of iteration loop. 

fdjac(n,x,fvec,fjac,vecfunc); 

If analytic Jacobian is available, you can replace the routine fdjac below with your 
own routine. 

for (i=l;i<=n;i++) { Compute V/for the line search, 

for (sum=0.0,j=l;j<=n;j++) sum += f jac[j] [i]*fvec[j] ; 
g[i]=sum; 

> 

for (i=l;i<=n;i++) xold[i]=x[i]; 
fold=f; 

for (i=l;i<=n;i++) p[i] = -fvec[i]; 
ludcmp(fjac,n,indx,&d); 
lubksb(fjac,n,indx,p); 
lnsrch(n,xold,fold,g,p,x,&f,stpmax,check,fmin); 

lnsrch returns new x and /. It also calculates fvec at the new x when it calls fmin. 
test=0.0; Test for convergence on function val- 

for (i=l;i<=n; i++) ues. 

if (fabs(fvec[i]) > test) test=fabs(fvec[i]); 
if (test < TOLF) { 

*check=0; 

FREERETURN 

> 

if (*check) { Check for gradient of / zero, i.e., spuri- 

test=0.0; ous convergence. 

den=FMAX(f,0.5*n); 
for (i=l;i<=n;i++) { 

temp=fabs(g[i])*FMAX(fabs(x[i]),1.0)/den; 


Store x, 
and /. 

Right-hand side for linear equations. 
Solve linear equations by LU decompo¬ 
sition. 
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if (temp > test) test=temp; 

> 

*check=(test < TOLMIN ? 1 : 0); 

FREERETURN 

> 

test=0.0; Test for convergence on Sx. 

for (i=l;i<=n;i++) { 

temp=(fabs(x[i]-xold[i] ))/FMAX(f abs(x[i] ) , 1.0); 
if (temp > test) test=temp; 

> 

if (test < TOLX) FREERETURN 

> 

nrerror("MAXITS exceeded in newt 11 ); 


#include <math.h> 

#include "nrutil.h" 

#define EPS 1.0e-4 Approximate square root of the machine precision. 

void fdjac(int n, float x[], float fvec[], float **df, 
void (*vecfunc)(int, float [] , float [])) 

Computes forward-difference approximation to Jacobian. On input, x[l. .n] is the point at 
which the Jacobian is to be evaluated, fvec[l. .n] is the vector of function values at the 
point, and vecfunc(n,x,f ) is a user-supplied routine that returns the vector of functions at 
x. On output, df [1. .n] [1. .n] is the Jacobian array. 

{ 

int i, j ; 

float h,temp,*f; 

f=vector(l,n); 
for (j=l;j<=n;j++) { 
temp=x[j] ; 
h=EPS*fabs(temp); 
if (h == 0.0) h=EPS; 

x[j]=temp+h; Trick to reduce finite precision error. 

h=x [ j]-temp; 

(*vecfunc)(n,x,f); 
x[j]=temp; 

for (i=l; i<=n; i++) df [i] [j] = (f [i]-fvec [i] )/h; Forward difference for- 

> mula. 

free_vector(f,l,n); 


#include "nrutil.h" 

extern int nn; 
extern float *fvec; 

extern void (*nrfuncv)(int n, float v[], float f[]); 
float fmin(float x[]) 

Returns / = |F • F at x. The global pointer *nrfuncv points to a routine that returns the 
vector of functions at x. It is set to point to a user-supplied routine in the calling program. 
Global variables also communicate the function values back to the calling program. 

{ 

int i; 
float sum; 

(♦nrfuncv)(nn,x,fvec); 

for (sum=0.0,i=l;i<=nn;i++) sum += SQR(fvec[i]); 
return 0.5*sum; 
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The routine newt assumes that typical values of all components of x and of F are of order 
unity, and it can fail if this assumption is badly violated. You should rescale the variables by 
their typical values before invoking newt if this problem occurs. 

Multidimensional Secant Methods: Broyden’s Method 

Newton’s method as implemented above is quite powerful, but it still has several 
disadvantages. One drawback is that the Jacobian matrix is needed. In many problems 
analytic derivatives are unavailable. If function evaluation is expensive, then the cost of 
finite-difference determination of the Jacobian can be prohibitive. 

Just as the quasi-Newton methods to be discussed in §10.7 provide cheap approximations 
for the Hessian matrix in minimization algorithms, there are quasi-Newton methods that 
provide cheap approximations to the Jacobian for zero finding. These methods are often called 
secant methods, since they reduce to the secant method (§9.2) in one dimension (see, e.g., [1 ]). 
The best of these methods still seems to be the first one introduced, Broyden’s methodic. 

Let us denote the approximate Jacobian by B. Then the ith quasi-Newton step <5x, 
is the solution of 

Bi ■ 5xi = -F, (9.7.15) 

where 8xi = x, + i — x, (cf. equation 9.7.3). The quasi-Newton or secant condition is that 
Bj+i satisfy 

B i+ i • 5xi = 5Fi (9.7.16) 

where c!F, = F, t+1 — F,. This is the generalization of the one-dimensional secant approxima¬ 
tion to the derivative, 8F/8x. However, equation (9.7.16) does not determine Bj+i uniquely 
in more than one dimension. 

Many different auxiliary conditions to pin down B, i have been explored, but the 
best-performing algorithm in practice results from Broyden’s formula. This formula is based 
on the idea of getting Bj+i by making the least change to Bi consistent with the secant 
equation (9.7.16). Broyden showed that the resulting formula is 

Sxi ■ 8xi 


You can easily check that B<+i satisfies (9.7.16). 

Early implementations of Broyden’s method used the Sherman-Morrison formula, 
equation (2.7.2), to invert equation (9.7.17) analytically. 


Bj+r = Br 1 + 


(Sxi — B^ 1 ■ <5Fi) ® Sxi ■ B^ 1 
8xi • 15 • <5Fi 


Then instead of solving equation (9.7.3) by e.g., LU decomposition, one determined 


Sxi = 


•Fi 


(9.7.19) 


by matrix multiplication in 0(N 2 ) operations. The disadvantage of this method is that 
it cannot easily be embedded in a globally convergent strategy, for which the gradient of 
equation (9.7.4) requires B, not 15 1 . 

V(|F-F)~B t -F (9.7.20) 

Accordingly, we implement the update formula in the form (9.7.17). 

However, we can still preserve the 0(N 2 ) solution of (9.7.3) by using QR decomposition 
(§2.10) instead of LU decomposition. The reason is that because of the special form of equation 
(9.7.17), the QR decomposition of Bi can be updated into the QR decomposition of Bi+i in 
0(N 2 ) operations (§2.10). All we need is an initial approximation Bo to start the ball rolling. 
It is often acceptable to start simply with the identity matrix, and then allow O(N) updates to 
produce a reasonable approximation to the Jacobian. We prefer to spend the first N function 
evaluations on a finite-difference approximation to initialize B via a call to fdjac. 
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Since B is not the exact Jacobian, we are not guaranteed that <5x is a descent direction for 
f = • F (cf. equation 9.7.5). Thus the line search algorithm can fail to return a suitable step 

if B wanders far from the true Jacobian. In this case, we reinitialize B by another call to f dj ac. 

Like the secant method in one dimension, Broyden’s method converges superlinearly 
once you get close enough to the root. Embedded in a global strategy, it is almost as robust 
as Newton’s method, and often needs far fewer function evaluations to determine a zero. 
Note that the final value of B is not always close to the true Jacobian at the root, even 
when the method converges. 

The routine broydn given below is very similar to newt in organization. The principal 
differences are the use of QR decomposition instead of LU, and the updating formula instead 
of directly determining the Jacobian. The remarks at the end of newt about scaling the 
variables apply equally to broydn. 


#include <math.h> 

#include "nrutil.h" 

#define MAXITS 200 
#define EPS 1.0e-7 
#define TOLF 1.0e-4 
#def ine TOLX EPS 
#define STPMX 100.0 
#define T0LMIN 1.0e-6 

Here MAXITS is the maximum number of iterations; EPS is a number close to the machine 
precision; TOLF is the convergence criterion on function values; TOLX is the convergence criterion 
on 6x; STPMX is the scaled maximum step length allowed in line searches; T0LMIN is used to 
decide whether spurious convergence to a minimum of fmin has occurred. 

#define FREERETURN {free_vector(fvec,1,n);free_vector(xold,1,n);\ 
free_vector(w,1,n);free_vector(t,1,n);free_vector(s,1,n);\ 
free_matrix(r,1,n,l,n);free_matrix(qt,l,n,l,n);free.vector(p,1,n);\ 
free_vector (g, 1,n);free_vector(fvcold,1,n);free_vector(d,l,n);\ 
free_vector(c,l,n);return;} 


int nn; Global variables to communicate with fmin. 

float *fvec; 

void (*nrfuncv)(int n, float v[], float f[]); 

void broydn(float x[], int n, int *check, 

void (*vecfunc) (int, float [], float [])) 

Given an initial guess x[l. .n] for a root in n dimensions, find the root by Broyden's method 
embedded in a globally convergent strategy. The vector of functions to be zeroed, called 
fvec[l. .n] in the routine below, is returned by the user-supplied routine vecfunc(n,x,fvec) . 
The routine fdjac and the function fmin from newt are used. The output quantity check 
is false (0) on a normal return and true (1) if the routine has converged to a local minimum 
of the function fmin or if Broyden's method can make no further progress. In this case try 
restarting from a different initial guess, 
f 

void fdjac(int n, float x[], float fvec[], float **df, 
void (*vecfunc) (int, float [] , float [])); 
float fmin(float x[]); 

void lnsrch(int n, float xold[] , float fold, float g[] , float p[], float x[], 
float *f, float stpmax, int *check, float (*func)(float [])); 
void qrdcmp(float **a, int n, float *c, float *d, int *sing); 
void qrupdt(float **r, float **qt, int n, float u[] , float v[]); 
void rsolv(float **a, int n, float d[] , float b[]); 
int i,its,j,k,restrt,sing,skip; 

float den,f,fold,stpmax,sum,temp,test,*c,*d,*fvcold; 
float *g,*p,**qt,**r,*s,*t,*w,*xold; 

c=vector(l,n); 
d=vector(l,n); 
fvcold=vector(1,n); 
g=vector(l,n); 
p=vector(l,n); 
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qt=matrix(l,n, 1,n); 
r=matrix(l,n,1 ,n); 
s=vector(l,n); 
t=vector(l,n); 
w=vector(l,n); 
xold=vector(1,n); 

fvec=vector(l ,n) ; Define global variables. 


nrfuncv=vecfunc; 

f=fmin(x); The vector fvec is also computed by this 

test=0.0; call. 

for (i=l;i<=n;i++) Test for initial guess being a root. Use more 

if (fabs(fvec [i]) > test)test=fabs (fvec [i] ) ; stringent test than sim- 

if (test < 0.01*T0LF) { ply TOLF. 

*check=0; 

FREERETURN 

> 

for (sum=0.0,i=l;i<=n;i++) sum += SQR(x[i]); Calculate stpmax for line searches. 
stpmax=STPMX*FMAX(sqrt(sum),(float)n); 

restrt=l; Ensure initial Jacobian gets computed, 

for (its=l;its<=MAXITS;its++) { Start of iteration loop, 

if (restrt) { 

fdjac(n,x,fvec,r.vecfunc); Initialize or reinitialize Jacobian in r. 
qrdcmpCrjn.c.d.&sing); QR decomposition of Jacobian, 

if (sing) nrerror("singular Jacobian in broydn"); 
for (i=l;i<=n;i++) { Form Q T explicitly, 

for (j=l;j<=n;j++) qt[i] [j] =0.0; 
qt [i] [i]=1.0; 

> 

for (k=l;k<n;k++) { 
if (c[k]) { 

for (j=l;j<=n;j++) { 
sum=0.0; 

for (i=k;i<=n;i++) 

sum += r [i] [k]*qt[i] [j] ; 
sum /= c [k] ; 
for (i=k;i<=n;i++) 

qt [i] [j] -= sum*r [i] [k] ; 

> 

> 

> 

for (i=l;i<=n;i++) { Form R explicitly, 

r [i] [i]=d[i] ; 

for (j=l;j<i;j++) r[i] [j]=0.0; 

> 

> else { Carry out Broyden update, 

for (i=l;i<=n;i++) s[i]=x[i]-xold[i]; s = 8x. 
for (i=l; i<=n;i++) { t = R • s. 

for (sum=0.0,j=i;j<=n;j++) sum += r[i][j]*s[j]; 
t[i]=sum; 

> 

skip=l; 

for (i=l;i<=n;i++) { w = 5F — B ■ s. 

for (sum=0.0,j=l;j<=n;j++) sum += qt[j][i]*t[j]; 
w [i] =f ■vec [i] -f vcold [i] -sum; 

if (fabs(w[i]) >= EPS*(fabs(fvec[i])+fabs(fvcold[i]))) skip=0; 
Don't update with noisy components of w. 
else w[i]=0.0; 

> 

if (!skip) { 

for (i=l;i<=n;i++) { 

for (sum=0.0,j=l;j<=n;j++) sum + 
t[i]=sum; 



t = Q t w. 

■= qt [i] [j]*w[j] ; 


Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 







392 


Chapter 9. Root Finding and Nonlinear Sets of Equations 


for (den=0.0,i=l;i<=n;i++) den += SQR(s[i]); 
for (i=l;i<=n;i++) s[i] /= den; Store s/(s • s) in s. 

qrupdt(r,qt,n,t,s) ; Update R and Q T . 

for (i=l;i<=n;i++) { 

if (r[i][i] == 0.0) nrerror("r singular in broydn"); 


d[i]=r [i] [i]; 


Diagonal of R stored i 


for (i=l;i<=n;i++) { 

for (sum=0.0,j=l;j<=n;j++) 
p[i] = -sum; 

> 

for (i=n;i>=l;i—) { 

for (sum=0.0,j=l;j<=i;j++) 
g[i]=sum; 

} 

for (i=l;i<=n;i++) { 
xold[i]=x[i] ; 
fvcold[i]=fvec[i] ; 

> 


Right-hand side for linear equations is —Q T F. 
sum += qt[i] [j]*fvec[j]; 


Compute V/ « (Q • R) T • F for the line search, 
sum -= r [j] [i] *p[j] ; 


Store /. 

rsolv(r,n,d,p); Solve linear equations, 

lnsrch(n,xold,f old,g,p,x,&f,stpmax,check,fmin); 

lnsrch returns new x and /. It also calculates fvec at the new x when it calls fmi 
test=0.0; Test for convergence on function values, 

for (i=l;i<=n;i++) 

if (fabs(fvec[i]) > test) test=fabs(fvec[i]); 
if (test < TOLF) { 

*check=0; 

FREERETURN 


if (*check) { True if line search failed to find a new x. 

if (restrt) FREERETURN Failure; already tried reinitializing the Jaco- 

else { bian. 

test=0.0; Check for gradient of / zero, i.e., spurious 

den=FMAX (f, 0.5*n); convergence, 

for (i=l;i<=n;i++) { 

temp=fabs(g[i])*FMAX(fabs(x[i]),1.0)/den; 
if (temp > test) test=temp; 

> 

if (test < TOLMIN) FREERETURN 

else restrt=l; Try reinitializing the Jacobian. 

> 

> else { Successful step; will use Broyden update for 

restrt=0; next step. 

test=0.0; Test for convergence on Sx. 

for (i=l; i<=n;i++) { 

temp= (fabs(x[i]-xold[i] ))/FMAX(fabs(x[i] ) ,1.0) ; 
if (temp > test) test=temp; 

> 

if (test < TOLX) FREERETURN 


nrerror("MAXITS exceeded in broydn"); 
FREERETURN 
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More Advanced Implementations 

One of the principal ways that the methods described so far can fail is if J (in Newton’s 
method) or B in (Broyden’s method) becomes singular or nearly singular, so that 8x cannot 
be determined. If you are lucky, this situation will not occur very often in practice. Methods 
developed so far to deal with this problem involve monitoring the condition number of J and 
perturbing J if singularity or near singularity is detected. This is most easily implemented 
if the QR decomposition is used instead of LU in Newton’s method (see [1 ] for details). 
Our personal experience is that, while such an algorithm can solve problems where J is 
exactly singular and the standard Newton’s method fails, it is occasionally less robust on 
other problems where LU decomposition succeeds. Clearly implementation details involving 
roundoff, underflow, etc., are important here and the last word is yet to be written. 

Our global strategies both for minimization and zero finding have been based on line 
searches. Other global algorithms, such as the hook step and dogleg step methods, are based 
instead on the model-trust region approach, which is related to the Levenberg-Marquardt 
algorithm for nonlinear least-squares (§15.5). While somewhat more complicated than line 
searches, these methods have a reputation for robustness even when starting far from the 
desired zero or minimum [1 ]. 
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Chapter 10. Minimization or 
Maximization of Functions 

10.0 Introduction 


In a nutshell: You are given a single function / that depends on one or more 
independent variables. You want to find the value of those variables where / takes 
on a maximum or a minimum value. You can then calculate what value of / is 
achieved at the maximum or minimum. The tasks of maximization and minimization 
are trivially related to each other, since one person’s function / could just as well 
be another’s —/. The computational desiderata are the usual ones: Do it quickly, 
cheaply, and in small memory. Often the computational effort is dominated by 
the cost of evaluating / (and also perhaps its partial derivatives with respect to all 
variables, if the chosen algorithm requires them). In such cases the desiderata are 
sometimes replaced by the simple surrogate: Evaluate / as few times as possible. 

An extremum (maximum or minimum point) can be either global (truly 
the highest or lowest function value) or local (the highest or lowest in a finite 
neighborhood and not on the boundary of that neighborhood). (See Figure 10.0.1.) 
Finding a global extremum is, in general, a very difficult problem. Two standard 
heuristics are widely used: (i) find local extrema starting from widely varying 
starting values of the independent variables (perhaps chosen quasi-randomly, as in 
§7.7), and then pick the most extreme of these (if they are not all the same); or 
(ii) perturb a local extremum by taking a finite amplitude step away from it, and 
then see if your routine returns you to a better point, or “always” to the same 
one. Relatively recently, so-called “simulated annealing methods” (§10.9) have 
demonstrated important successes on a variety of global extremization problems. 

Our chapter title could just as well be optimization, which is the usual name 
for this very large field of numerical research. The importance ascribed to the 
various tasks in this field depends strongly on the particular interests of whom 
you talk to. Economists, and some engineers, are particularly concerned with 
constrained optimization, where there are a priori limitations on the allowed values 
of independent variables. For example, the production of wheat in the U.S. must 
be a nonnegative number. One particularly well-developed area of constrained 
optimization is linear programming, where both the function to be optimized and 
the constraints happen to be linear functions of the independent variables. Section 
10.8, which is otherwise somewhat disconnected from the rest of the material that we 
have chosen to include in this chapter, implements the so-called “simplex algorithm” 
for linear programming problems. 
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Figure 10.0.1. Extrema of a function in an interval. Points A, C, and E are local, but not global 
maxima. Points B and F are local, but not global minima. The global maximum occurs at G, which 
is on the boundary of the interval so that the derivative of the function need not vanish there. The 
global minimum is at D. At point E, derivatives higher than the first vanish, a situation which can 
cause difficulty for some algorithms. The points X, Y, and Z are said to “bracket” the minimum F, 
since Y is less than both X and Z. 

One other section, § 10.9, also lies outside of our main thrust, but for a different 
reason: so-called “annealing methods” are relatively new, so we do not yet know 
where they will ultimately fit into the scheme of things. However, these methods 
have solved some problems previously thought to be practically insoluble; they 
address directly the problem of finding global extrema in the presence of large 
numbers of undesired local extrema. 

The other sections in this chapter constitute a selection of the best established 
algorithms in unconstrained minimization. (For definiteness, we will henceforth 
regard the optimization problem as that of minimization.) These sections are 
connected, with later ones depending on earlier ones. If you are just looking for 
the one “perfect” algorithm to solve your particular application, you may feel that 
we are telling you more than you want to know. Unfortunately, there is no perfect 
optimization algorithm. This is a case where we strongly urge you to try more than 
one method in comparative fashion. Your initial choice of method can be based 
on the following considerations: 



• You must choose between methods that need only evaluations of the 
function to be minimized and methods that also require evaluations of the 
derivative of that function. In the multidimensional case, this derivative 
is the gradient, a vector quantity. Algorithms using the derivative are 
somewhat more powerful than those using only the function, but not 
always enough so as to compensate for the additional calculations of 
derivatives. We can easily construct examples favoring one approach or 
favoring the other. However, if you can compute derivatives, be prepared 
to try using them. 

• For one-dimensional minimization (minimize a function of one variable) 
without calculation of the derivative, bracket the minimum as described in 
§10.1, and then use Brent’s method as described in §10.2. If your function 
has a discontinuous second (or lower) derivative, then the parabolic 
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interpolations of Brent’s method are of no advantage, and you might wish 
to use the simplest form of golden section search, as described in §10.1. 

• For one-dimensional minimization with calculation of the derivative, § 10.3 
supplies a variant of Brent’s method which makes limited use of the 
first derivative information. We shy away from the alternative of using 
derivative information to construct high-order interpolating polynomials. 

In our experience the improvement in convergence very near a smooth, 
analytic minimum does not make up for the tendency of polynomials 
sometimes to give wildly wrong interpolations at early stages, especially 
for functions that may have sharp, “exponential” features. 

We now turn to the multidimensional case, both with and without computation 
of first derivatives. 

• You must choose between methods that require storage of order N 2 and 
those that require only of order N, where N is the number of dimensions. 
For moderate values of N and reasonable memory sizes this is not a 
serious constraint. There will be, however, the occasional application 
where storage may be critical. 

• We give in §10.4 a sometimes overlooked downhill simplex method due 
to Nelder and Mead. (This use of the word “simplex” is not to be 
confused with the simplex method of linear programming.) This method 
just crawls downhill in a straightforward fashion that makes almost no 
special assumptions about your function. This can be extremely slow, but 
it can also, in some cases, be extremely robust. Not to be overlooked is 
the fact that the code is concise and completely self-contained: a general 
A’-dimensional minimization program in under 100 program lines! This 
method is most useful when the minimization calculation is only an 
incidental part of your overall problem. The storage requirement is of 
order TV 2 , and derivative calculations are not required. 

• Section 10.5 deals with direction-set methods, of which Powell’s method 
is the prototype. These are the methods of choice when you cannot easily 
calculate derivatives, and are not necessarily to be sneered at even if you 
can. Although derivatives are not needed, the method does require a 
one-dimensional minimization sub-algorithm such as Brent’s method (see 
above). Storage is of order N 2 . 

There are two major families of algorithms for multidimensional minimization 
with calculation of first derivatives. Both families require a one-dimensional 
minimization sub-algorithm, which can itself either use, or not use, the derivative 
information, as you see fit (depending on the relative effort of computing the function 
and of its gradient vector). We do not think that either family dominates the other in 
all applications; you should think of them as available alternatives: 

• The first family goes under the name conjugate gradient methods, as typi¬ 
fied by the Fletcher-Reeves algorithm and the closely related and probably 
superior Polak-Ribiere algorithm. Conjugate gradient methods require 
only of order a few times N storage, require derivative calculations and 



S, § g 
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one-dimensional sub-minimization. Turn to §10.6 for detailed discussion 
and implementation. 

• The second family goes under the names quasi-Newton or variable metric 
methods, as typified by the Davidon-Fletcher-Powell (DFP) algorithm 
(sometimes referred to just as Fletcher-Powell) or the closely related 
Broyden-Fletcher-Goldfarb-Shanno (BFGS) algorithm. These methods 
require of order N 2 storage, require derivative calculations and one¬ 
dimensional sub-minimization. Details are in §10.7. 

You are now ready to proceed with scaling the peaks (and/or plumbing the 
depths) of practical optimization. 


CITED REFERENCES AND FURTHER READING: 

Dennis, J.E., and Schnabel, R.B. 1983, Numerical Methods for Unconstrained Optimization and 
Nonlinear Equations (Englewood Cliffs, NJ: Prentice-Hall). 

Polak, E. 1971, Computational Methods in Optimization (New York: Academic Press). 

Gill, P.E., Murray, W., and Wright, M.H. 1981, Practical Optimization (New York: Academic Press). 

Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), Chapter 17. 

Jacobs, D.A.H. (ed.) 1977, The State of the Art in Numerical Analysis (London: Academic 
Press), Chapter 111.1. 

Brent, R.P. 1973, Algorithms for Minimization without Derivatives (Englewood Cliffs, NJ: Prentice- 
Hall). 

Dahlquist, G., and Bjorck, A. 1974, Numerical Methods (Englewood Cliffs, NJ: Prentice-Hall), 
Chapter 10. 


10.1 Golden Section Search in One Dimension 

Recall how the bisection method finds roots of functions in one dimension 
(§9.1): The root is supposed to have been bracketed in an interval (a, b). One 
then evaluates the function at an intermediate point x and obtains a new, smaller 
bracketing interval, either (a, x) or (x. b). The process continues until the bracketing 
interval is acceptably small. It is optimal to choose x to be the midpoint of (a, b) 
so that the decrease in the interval length is maximized when the function is as 
uncooperative as it can be, i.e., when the luck of the draw forces you to take the 
bigger bisected segment. 

There is a precise, though slightly subtle, translation of these considerations to 
the minimization problem: What does it mean to bracket a minimum? A root of a 
function is known to be bracketed by a pair of points, a and 6, when the function 
has opposite sign at those two points. A minimum, by contrast, is known to be 
bracketed only when there is a triplet of points, a < b < c (or c < b < a), such that 
f(b) is less than both /(a) and /(c). In this case we know that the function (if it 
is nonsingular) has a minimum in the interval (a, c). 

The analog of bisection is to choose a new point x, either between a and b or 
between b and c. Suppose, to be specific, that we make the latter choice. Then we 
evaluate f(x). If f(b) < f(x), then the new bracketing triplet of points is (a, b, x); 
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Figure 10.1.1. Successive bracketing of a minimum. The minimum is originally bracketed by points 
1,3,2. The function is evaluated at 4, which replaces 2; then at 5, which replaces 1; then at 6, which 
replaces 4. The rule at each stage is to keep a center point that is lower than the two outside points. After 
the steps shown, the minimum is bracketed by points 5,3,6. 

contrariwise, if f(b) > f(x), then the new bracketing triplet is (b, x, c). In all cases 
the middle point of the new triplet is the abscissa whose ordinate is the best minimum 
achieved so far; see Figure 10.1.1. We continue the process of bracketing until the 
distance between the two outer points of the triplet is tolerably small. 

How small is “tolerably” small? For a minimum located at a value b, you 
might naively think that you will be able to bracket it in as small a range as 
(1 — e)b < b < (1 + e)b, where e is your computer’s floating-point precision, a 
number like 3 x 10 -8 (for float) or 10 -15 (for double). Not so! In general, the 
shape of your function f(x) near b will be given by Taylor’s theorem 


f(x) « f(b) + ^f"(b)(x - bf (10.1.1) 

The second term will be negligible compared to the first (that is, will be a factor e 
smaller and will act just like zero when added to it) whenever 



The reason for writing the right-hand side in this way is that, for most functions, 
the final square root is a number of order unity. Therefore, as a rule of thumb, it 
is hopeless to ask for a bracketing interval of width less than y/e times its central 
value, a fractional width of only about 10 ~ 4 (single precision) or 3 x 10 _8 (double 
precision). Knowing this inescapable fact will save you a lot of useless bisections! 

The minimum-finding routines of this chapter will often call for a user-supplied 
argument tol, and return with an abscissa whose fractional precision is about ±tol 
(bracketing interval of fractional size about 2xtol). Unless you have a better 
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estimate for the right-hand side of equation (10.1.2), you should set tol equal to 
(not much less than) the square root of your machine’s floating-point precision, since 
smaller values will gain you nothing. 

It remains to decide on a strategy for choosing the new point x, given (a, b, c). 
Suppose that b is a fraction w of the way between a and c, i.e. 


Also suppose that our next trial point x is an additional fraction z beyond b. 


Then the next bracketing segment will either be of length w+z relative to the current 
one, or else of length 1 — w. If we want to minimize the worst case possibility, then 
we will choose z to make these equal, namely 


We see at once that the new point is the symmetric point to b in the original interval, 
namely with \b — a| equal to \x — c\. This implies that the point x lies in the larger 
of the two segments (z is positive only if w < 1/2). 

But where in the larger segment? Where did the value of w itself come from? 
Presumably from the previous stage of applying our same strategy. Therefore, if z 
is chosen to be optimal, then so was w before it. This scale similarity implies that 
x should be the same fraction of the way from b to c (if that is the bigger segment) 
as was b from a to c, in other words, 


Equations (10.1.5) and (10.1.6) give the quadratic equation 



3 — v5 

w 2 — 3w + 1 = 0 yielding w =—-—« 0.38197 (10.1.7) 

In other words, the optimal bracketing interval (a, b, c ) has its middle point b a 
fractional distance 0.38197 from one end (say, a), and 0.61803 from the other end 
(say, b ). These fractions are those of the so-called golden mean or golden section, 
whose supposedly aesthetic properties hark back to the ancient Pythagoreans. This 
optimal method of function minimization, the analog of the bisection method for 
finding zeros, is thus called the golden section search, summarized as follows: 

Given, at each stage, a bracketing triplet of points, the next point to be tried 
is that which is a fraction 0.38197 into the larger of the two intervals (measuring 
from the central point of the triplet). If you start out with a bracketing triplet whose 
segments are not in the golden ratios, the procedure of choosing successive points 
at the golden mean point of the larger segment will quickly converge you to the 
proper, self-replicating ratios. 

The golden section search guarantees that each new function evaluation will 
(after self-replicating ratios have been achieved) bracket the minimum to an interval 
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just 0.61803 times the size of the preceding interval. This is comparable to, but not 
quite as good as, the 0.50000 that holds when finding roots by bisection. Note that 
the convergence is linear (in the language of Chapter 9), meaning that successive 
significant figures are won linearly with additional function evaluations. In the 
next section we will give a superlinear method, where the rate at which successive 
significant figures are liberated increases with each successive function evaluation. 

Routine for Initially Bracketing a Minimum 

The preceding discussion has assumed that you are able to bracket the minimum 
in the first place. We consider this initial bracketing to be an essential part of any 
one-dimensional minimization. There are some one-dimensional algorithms that 
do not require a rigorous initial bracketing. However, we would never trade the 
secure feeling of knowing that a minimum is “in there somewhere” for the dubious 
reduction of function evaluations that these nonbracketing routines may promise. 
Please bracket your minima (or, for that matter, your zeros) before isolating them! 

There is not much theory as to how to do this bracketing. Obviously you want 
to step downhill. But how far? We like to take larger and larger steps, starting with 
some (wild?) initial guess and then increasing the stepsize at each step either by 
a constant factor, or else by the result of a parabolic extrapolation of the preceding 
points that is designed to take us to the extrapolated turning point. It doesn’t much 
matter if the steps get big. After all, we are stepping downhill, so we already have 
the left and middle points of the bracketing triplet. We just need to take a big enough 
step to stop the downhill trend and get a high third point. 

Our standard routine is this: 

#include <math.h> 

#include "nrutil.h" 

#define GOLD 1.618034 
#define GLIMIT 100.0 
#define TINY 1.0e-20 

#define SHFT(a,b,c,d) (a)=(b);(b)=(c);(c)=(d); 

Here GOLD is the default ratio by which successive intervals are magnified; GLIMIT is the 
maximum magnification allowed for a parabolic-fit step. 

void mnbrak(float *ax, float *bx, float *cx, float *fa, float *fb, float *fc, 

float (*func)(float)) 

Given a function func, and given distinct initial points ax and bx, this routine searches in 
the downhill direction (defined by the function as evaluated at the initial points) and returns 
new points ax, bx, cx that bracket a minimum of the function. Also returned are the function 
values at the three points, fa, fb, and fc. 

{ 

float ulim,u,r,q,fu,dum; 

*fa=(*func)(*ax); 

*fb=(*func)(*bx); 

if (*fb > *fa) { Switch roles of a and b so that we can go 

SHFT(dum,*ax,*bx,dum) downhill in the direction from a to b. 

SHFT (dum,*fb,*f a, dum) 

} 

*cx=(*bx)+G0LD*(*bx-*ax); First guess for c. 

*fc=(*func)(*cx); 

while (*fb > *fc) { Keep returning here until we bracket. 

r=(*bx-*ax)*(*fb-*fc); Compute u by parabolic extrapolation from 

q=(*bx-*cx)*(*fb-*fa); a,b,c. TINY is used to prevent any pos- 

u=(*bx)-((*bx-*cx)*q-(*bx-*ax)*r)/ sible division by zero. 
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> 


> 


(2.0*SIGN(FMAX(fabs(q-r),TINY),q-r)); 
ulim=(*bx)+GLIMIT*(*cx-*bx); 

We won't go farther than this. Test various possibilities: 

if ((*bx-u)*(u-*cx) > 0.0) { Parabolic u is between b and c: try it. 

fu=(*func)(u); 


if (fu < *fc) { 

*ax=(*bx); 

*bx=u; 

*fa=(*fb); 

*fb=fu; 
return; 

> else if (fu > *fb) { 

*cx=u; 

*fc=fu; 
return; 

> 

u=(*cx)+G0LD*(*cx-*bx); 
fu=(*func)(u); 

> else if ((*cx-u)*(u-ulim) > 0.0) 
fu=(*func)(u); 
if (fu < *fc) { 


Got a minimum between b and c. 


Got a minimum between between a and u. 


Parabolic fit was no use. Use default mag¬ 
nification. 

{ Parabolic fit is between c and its 

allowed limit. 


SHFT(*bx,*cx,u,*cx+G0LD*(*cx-*bx)) 
SHFT(*fb,*fc,fu,(*func)(u)) 


> 

> else if ((u-ulim)*(ulim-*cx) >=0.0) { Limit parabolic u to maximum 

u=ulim; allowed value. 

fu=(*func)(u); 

> else { Reject parabolic u, use default magnifica- 

u=(*cx)+G0LD*(*cx-*bx); tion. 

fu=(*func)(u); 


} 

SHFT(*ax,*bx,*cx,u) Eliminate oldest point and continue. 

SHFT (*f a, *f b, *f c, f u) 


(Because of the housekeeping involved in moving around three or four points and 
their function values, the above program ends up looking deceptively formidable. 
That is true of several other programs in this chapter as well. The underlying ideas, 
however, are quite simple.) 


Routine for Golden Section Search 


#include <math.h> 

#define R 0.61803399 The golden ratios. 

#define C (1.0-R) 

#define SHFT2(a,b,c) (a)=(b);(b)=(c); 

#define SHFT3(a,b,c,d) (a)=(b);(b)=(c);(c)=(d); 

float golden(float ax, float bx, float cx, float (*f)(float), float tol, 
float *xmin) 

Given a function f, and given a bracketing triplet of abscissas ax, bx, cx (such that bx is 
between ax and cx, and f (bx) is less than both f (ax) and f (cx)), this routine performs a 
golden section search for the minimum, isolating it to a fractional precision of about tol. The 
abscissa of the minimum is returned as xmin, and the minimum function value is returned as 
golden, the returned function value. 

( 

float f1,f2,xO,xl,x2,x3; 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 




402 


Chapter 10. Minimization or Maximization of Functions 


> 


x0=ax; 
x3=cx; 

if (fabs(cx-bx) > fabs(bx-ax)) { 
xl=bx; 

x2=bx+C*(cx-bx); 

> else { 

x2=bx; 

xl=bx-C*(bx-ax); 

> 

fl=(*f)(xl); 
f2=(*f)(x2); 

while (fabs(x3-x0) > tol*(fabs(xl) 
if (f2 < fl) { 

SHFT3(xO,xl,x2,R*xl+C*x3) 
SHFT2(fl,f2,(*f)(x2)) 

} else { 

SHFT3 (x3,x2,xl, R*x2+C*x0) 
SHFT2(f2,fl,(*f)(xl)) 

> 

> 

if (fl < f2) { 

*xmin=xl; 
return fl; 

> else { 

*xmin=x2; 
return f2; 


At any given time we will keep track of four 
points, x0,xl,x2,x3. 

Make xO to xl the smaller segment, 

and fill in the new point to be tried. 


The initial function evaluations. Note that 
we never need to evaluate the function 
fabs(x2))) { at the original endpoints. 
One possible outcome, 
its housekeeping, 
and a new function evaluation. 

The other outcome, 

and its new function evaluation. 

Back to see if we are done. 

We are done. Output the best of the two 


10.2 Parabolic Interpolation and Brent’s Method 
in One Dimension 



We already tipped our hand about the desirability of parabolic interpolation in 
the previous section’s mnbrak routine, but it is now time to be more explicit. A 
golden section search is designed to handle, in effect, the worst possible case of 
function minimization, with the uncooperative minimum hunted down and cornered 
like a scared rabbit. But why assume the worst? If the function is nicely parabolic 
near to the minimum — surely the generic case for sufficiently smooth functions — 
then the parabola fitted through any three points ought to take us in a single leap 
to the minimum, or at least very near to it (see Figure 10.2.1). Since we want to 
find an abscissa rather than an ordinate, the procedure is technically called inverse 
parabolic interpolation. 

The formula for the abscissa x that is the minimum of a parabola through three 
points /(a), f(b), and /(c) is 

1 (6-a) 2 [/(6)-/(c)]-(6-c) 2 [/(6)-/(a)] 

* 2 (b — a) [f (b) — /(c)] — (b — c) [f (b) — f(a)] { "> 

as you can easily derive. This formula fails only if the three points are collinear, 
in which case the denominator is zero (minimum of the parabola is infinitely far 
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Figure 10.2.1. Convergence to a minimum by inverse parabolic interpolation. A parabola (dashed line) is 
drawn through the three original points 1,2,3 on the given function (solid line). The function is evaluated 
at the parabola’s minimum, 4, which replaces point 3. A new parabola (dotted line) is drawn through 
points 1,4,2. The minimum of this parabola is at 5, which is close to the minimum of the function. 


away). Note, however, that (10.2.1) is as happy jumping to a parabolic maximum 
as to a minimum. No minimization scheme that depends solely on (10.2.1) is likely 
to succeed in practice. 

The exacting task is to invent a scheme that relies on a sure-but-slow technique, 
like golden section search, when the function is not cooperative, but that switches 
over to (10.2.1) when the function allows. The task is nontrivial for several 
reasons, including these: (i) The housekeeping needed to avoid unnecessary function 
evaluations in switching between the two methods can be complicated, (ii) Careful 
attention must be given to the “endgame,” where the function is being evaluated 
very near to the roundoff limit of equation (10.1.2). (iii) The scheme for detecting a 
cooperative versus noncooperative function must be very robust. 

Brent’s method [1] is up to the task in all particulars. At any particular stage, 
it is keeping track of six function points (not necessarily all distinct), a, b, u, v, 
w and x, defined as follows: the minimum is bracketed between a and 6; x is the 
point with the very least function value found so far (or the most recent one in 
case of a tie); w is the point with the second least function value; v is the previous 
value of w, u is the point at which the function was evaluated most recently. Also 
appearing in the algorithm is the point x m , the midpoint between a and 6; however, 
the function is not evaluated there. 

You can read the code below to understand the method’s logical organization. 
Mention of a few general principles here may, however, be helpful: Parabolic 
interpolation is attempted, fitting through the points x, v, and w. To be acceptable, 
the parabolic step must (i) fall within the bounding interval (a, b), and (ii) imply a 
movement from the best current value x that is less than half the movement of the 
step before last. This second criterion insures that the parabolic steps are actually 
converging to something, rather than, say, bouncing around in some nonconvergent 
limit cycle. In the worst possible case, where the parabolic steps are acceptable but 
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useless, the method will approximately alternate between parabolic steps and golden 
sections, converging in due course by virtue of the latter. The reason for comparing 
to the step before last seems essentially heuristic: Experience shows that it is better 
not to “punish” the algorithm for a single bad step if it can make it up on the next one. 

Another principle exemplified in the code is never to evaluate the function less 
than a distance tol from a point already evaluated (or from a known bracketing 
point). The reason is that, as we saw in equation (10.1.2), there is simply no 
information content in doing so: the function will differ from the value already 
evaluated only by an amount of order the roundoff error. Therefore in the code below 
you will find several tests and modifications of a potential new point, imposing this 
restriction. This restriction also interacts subtly with the test for “doneness,” which 
the method takes into account. 

A typical ending configuration for Brent’s method is that a and b are 2 x x x tol 
apart, with x (the best abscissa) at the midpoint of a and b, and therefore fractionally 
accurate to ±tol. 

Indulge us a final reminder that tol should generally be no smaller than the 
square root of your machine’s floating-point precision. 


#include <math.h> 

#include "nrutil.h" 

#define ITMAX 100 
#define CGOLD 0.3819660 
#define ZEPS 1.0e-10 

Here ITMAX is the maximum allowed number of iterations; CGOLD is the golden ratio; ZEPS is 
a small number that protects against trying to achieve fractional accuracy for a minimum that 
happens to be exactly zero. 

#define SHFT(a,b,c,d) (a)=(b);(b)=(c);(c)=(d); 


float brentCfloat ax, float bx, float cx, float (*f)(float), float tol, 
float *xmin) 

Given a function f, and given a bracketing triplet of abscissas ax, bx, cx (such that bx is 
between ax and cx, and f (bx) is less than both f (ax) and f(cx)), this routine isolates 
the minimum to a fractional precision of about tol using Brent's method. The abscissa of 
the minimum is returned as xmin, and the minimum function value is returned as brent, the 
returned function value. 

{ 


int iter; 

float a,b,d,etemp,fu,fv,fw,fx,p,q,r,toll,tol2,u,v,w,x,xm 


float e=0.0; 

a=(ax < cx ? ax : cx); 
b=(ax > cx ? ax : cx); 
x=w=v=bx; 
fw=fv=fx=(*f)(x); 
for (iter=l;iter<=ITMAX;iter++) { 
xm=0.5* (a+b); 

tol2=2.0*(toll=tol*fabs(x)+ZEPS); 
if (fabs(x-xm) <= (tol2-0.5*(b-a))) 
*xmin=x; 


This will be the distance moved on 
the step before last, 
a and b must be in ascending order, 
but input abscissas need not be. 
Initializations... 

Main program loop. 


{ Test for done here. 


return fx; 


(fabs(e) > toll) { 
r=(x-w)*(fx-fv); 
q=(x-v)*(fx-fw); 
p=(x-v)*q-(x-w)*r; 
q=2.0* (q-r) ; 
if (q > 0.0) p = -p; 
q=fabs(q); 



> 

if 


Construct a trial parabolic fit. 
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etemp=e; 
e=d; 

if (fabs(p) >= fabs(0.5*q*etemp) I I p <= q*(a-x) I I p >= q*(b-x)) 
d=CG0LD*(e=(x >= xm ? a-x : b-x)); 

The above conditions determine the acceptability of the parabolic fit. Here we 
take the golden section step into the larger of the two segments, 
else { 

d=p/q; Take the parabolic step. 

u=x+d; 

if (u-a < tol2 || b-u < tol2) 
d=SIGN(toll,xm-x); 

> 

> else { 

d=CG0LD*(e=(x >= xm ? a-x : b-x)); 

> 

u=(fabs(d) >= toll ? x+d : x+SIGN(toll,d)); 
fu=(*f)(u); 

This is the one function evaluation per iteration. 

if (fu <= fx) { Now decide what to do with our func- 

if (u >= x) a=x; else b=x; tion evaluation. 

SHFT(v,w,x,u) Housekeeping follows: 

SHFT(fv,fv,fx,fu) 

} else { 

if (u < x) a=u; else b=u; 
if (fu <= fw I I w == x) { 

fv=fw; 
fw=fu; 

} else if (fu <= fv I I v == x I I v 
v=u; 
fv=fu; 

> 

} 

> 

nrerror("Too many iterations in brent"); 

*xmin=x; 
return fx; 


== w) { 


Done with housekeeping. Back for 
another iteration. 


Never get here. 


CITED REFERENCES AND FURTHER READING: 

Brent, R.R 1973, Algorithms for Minimization without Derivatives (Englewood Cliffs, NJ: Prentice- 
Hall), Chapter 5. [1] 

Forsythe, G.E., Malcolm, M.A., and Moler, C.B. 1977, Computer Methods for Mathematical 
Computations (Englewood Cliffs, NJ: Prentice-Hall), §8.2. 


10.3 One-Dimensional Search with First 
Derivatives 



Here we want to accomplish precisely the same goal as in the previous 
section, namely to isolate a functional minimum that is bracketed by the triplet of 
abscissas ( a,b,c ), but utilizing an additional capability to compute the function’s 
first derivative as well as its value. 
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In principle, we might simply search for a zero of the derivative, ignoring the 
function value information, using a root finder like rtf lsp or zbrent (§§9.2-9.3). 
It doesn’t take long to reject that idea: How do we distinguish maxima from minima? 
Where do we go from initial conditions where the derivatives on one or both of 
the outer bracketing points indicate that “downhill” is in the direction out of the 
bracketed interval? 

We don’t want to give up our strategy of maintaining a rigorous bracket on the 
minimum at all times. The only way to keep such a bracket is to update it using 
function (not derivative) information, with the central point in the bracketing triplet 
always that with the lowest function value. Therefore the role of the derivatives can 
only be to help us choose new trial points within the bracket. 

One school of thought is to “use everything you’ve got”: Compute a polynomial 
of relatively high order (cubic or above) that agrees with some number of previous 
function and derivative evaluations. For example, there is a unique cubic that agrees 
with function and derivative at two points, and one can jump to the interpolated 
minimum of that cubic (if there is a minimum within the bracket). Suggested by 
Davidon and others, formulas for this tactic are given in [1], 

We like to be more conservative than this. Once superlinear convergence sets 
in, it hardly matters whether its order is moderately lower or higher. In practical 
problems that we have met, most function evaluations are spent in getting globally 
close enough to the minimum for superlinear convergence to commence. So we are 
more worried about all the funny “stiff” things that high-order polynomials can do 
(cf. Figure 3.0.1b), and about their sensitivities to roundoff error. 

This leads us to use derivative information only as follows: The sign of the 
derivative at the central point of the bracketing triplet (a, b, c) indicates uniquely 
whether the next test point should be taken in the interval (a, b ) or in the interval 
(b, c). The value of this derivative and of the derivative at the second-best-so-far 
point are extrapolated to zero by the secant method (inverse linear interpolation), 
which by itself is superlinear of order 1.618. (The golden mean again: see [1], p. 57.) 
We impose the same sort of restrictions on this new trial point as in Brent’s method. 
If the trial point must be rejected, we bisect the interval under scrutiny. 

Yes, we are fuddy-duddies when it comes to making flamboyant use of derivative 
information in one-dimensional minimization. But we have met too many functions 
whose computed “derivatives” don’t integrate up to the function value and don’t 
accurately point the way to the minimum, usually because of roundoff errors, 
sometimes because of truncation error in the method of derivative evaluation. 

You will see that the following routine is closely modeled on brent in the 
previous section. 

#include <math.h> 

#include "nrutil.h" 

#define ITMAX 100 
#define ZEPS 1.0e-10 

#define M0V3(a,b,c, d,e,f) (a)=(d);(b)=(e);(c)=(f); 

float dbrent(float ax, float bx, float cx, float (*f)(float), 
float (*df)(float), float tol, float *xmin) 

Given a function f and its derivative function df, and given a bracketing triplet of abscissas ax, 
bx, cx [such that bx is between ax and cx, and f (bx) is less than both f (ax) and f (cx)], 
this routine isolates the minimum to a fractional precision of about tol using a modification of 
Brent's method that uses derivatives. The abscissa of the minimum is returned as xmin, and 
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the minimum function value is returned as dbrent, the returned function value. 

{ 

int iter,okl,ok2; Will be used as flags for whether pro¬ 
float a,b,d,dl ) d2,du,dv,dw,dx,e=0.0; posed steps are acceptable or not. 

float fu,fv,fw ) fx,olde,toll,tol2,u,ul,u2,v ) w,x,xm; 


Comments following will point out only differences from the routine brent. Read that 
routine first. 

a=(ax < cx ? ax : cx); 
b=(ax > cx ? ax : cx); 


x=w=v=bx; 
fw=fv=fx=(*f)(x); 

All our housekeeping chores are dou¬ 
bled by the necessity of moving 
derivative values around as well 
as function values. 


Initialize these d'sto an out-of-bracket 
value. 

Secant method with one point. 

And the other. 

Which of these two estimates of d shall we take? We will insist that they be within 
the bracket, and on the side pointed to by the derivative at x: 
ul=x+dl; 
u2=x+d2; 

okl = (a-ul)*(ul-b) > 0.0 kk dx*dl <= 0.0; 
ok2 = (a-u2)*(u2-b) > 0.0 kk dx*d2 <= 0.0; 

olde=e; Movement on the step before last. 

e=d; 

if (okl | | ok2) { Take only an acceptable d, and if 

if (okl kk ok2) both are acceptable, then take 

d=(fabs(dl) < fabs(d2) ? dl : d2); the smallest one. 
else if (okl) 
d=dl; 

else 


dw=dv=dx=(*df)(x); 
for (iter=l;iter<=ITMAX;iter++) { 
xm=0.5*(a+b); 
toll=tol*fabs(x)+ZEPS; 
tol2=2.0*toll; 

if (fabs(x-xm) <= (tol2-0.5*(b-a))) { 
*xmin=x; 
return fx; 

> 

if (fabs(e) > toll) { 
dl=2.0*(b-a); 
d2=dl; 

if (dw != dx) dl=(w-x)*dx/(dx-dw); 
if (dv != dx) d2=(v-x)*dx/(dx-dv); 


d=d2; 

if (fabs(d) <= fabs(0.5*olde)) { 
u=x+d; 

if (u-a < tol2 || b-u < tol2) 
d=SIGN(toll,xm-x); 

> else { Bisect, not golden section. 

d=0.5*(e=(dx >= 0.0 ? a-x : b-x)); 

Decide which segment by the sign of the derivative. 

> 

> else { 

d=0.5*(e=(dx >= 0.0 ? a-x : b-x)); 

> 

> else { 

d=0.5*(e=(dx >= 0.0 ? a-x : b-x)); 

> 

if (fabs(d) >= toll) { 
u=x+d; 
fu=(*f)(u); 

> else { 

u=x+SIGN(toll,d); 
fu=(*f)(u); 

if (fu > fx) { If the minimum step in the downhill 

*xmin=x; direction takes us uphill, then 

return fx; we are done. 
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> 

} 

du=(*df)(u); Now all the housekeeping, sigh, 

if (fu <= fx) { 

if (u >= x) a=x; else b=x; 

M0V3(v,fv,dv, w,fw,dw) 

M0V3(w,fw,dw, x,fx,dx) 

M0V3(x,fx,dx, u,fu,du) 

> else { 

if (u < x) a=u; else b=u; 
if (fu <= fw || w == x) { 

M0V3(v,fv,dv, w,fw,dw) 

M0V3(w,fw,dw, u,fu,du) 

} else if (fu < fv I I v == x I I v == w) { 

M0V3(v,fv,dv, u,fu,du) 

> 

> 

> 

nrerror("Too many iterations in routine dbrent"); 
return 0.0; Never get here. 


CITED REFERENCES AND FURTHER READING: 

Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), pp. 55; 454-458. [1] 

Brent, R.R 1973, Algorithms for Minimization without Derivatives (Englewood Cliffs, NJ: Prentice- 
Hall), p. 78. 


10.4 Downhill Simplex Method in 
Multidimensions 

With this section we begin consideration of multidimensional minimization, 
that is, finding the minimum of a function of more than one independent variable. 
This section stands apart from those which follow, however: All of the algorithms 
after this section will make explicit use of a one-dimensional minimization algorithm 
as a part of their computational strategy. This section implements an entirely 
self-contained strategy, in which one-dimensional minimization does not figure. 

The downhill simplex method is due to Nelder and Mead [1 ]. The method 
requires only function evaluations, not derivatives. It is not very efficient in terms 
of the number of function evaluations that it requires. Powell’s method (§10.5) is 
almost surely faster in all likely applications. However, the downhill simplex method 
may frequently be the best method to use if the figure of merit is “get something 
working quickly” for a problem whose computational burden is small. 

The method has a geometrical naturalness about it which makes it delightful 
to describe or work through: 

A simplex is the geometrical figure consisting, in N dimensions, of N + 1 
points (or vertices) and all their interconnecting line segments, polygonal faces, etc. 
In two dimensions, a simplex is a triangle. In three dimensions it is a tetrahedron, 
not necessarily the regular tetrahedron. (The simplex method of linear programming. 
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high 



simplex at beginning of step 


low 


(a) 


(b) 


(c) 



contraction 


(d) 



multiple 

contraction 


Figure 10.4.1. Possible outcomes for a step in the downhill simplex method. The simplex at the 
beginning of the step, here a tetrahedron, is shown, top. The simplex at the end of the step can be any one 
of (a) a reflection away from the high point, (b) a reflection and expansion away from the high point, (c) 
a contraction along one dimension from the high point, or (d) a contraction along all dimensions towards 
the low point. An appropriate sequence of such steps will always converge to a minimum of the function. 



reflection 



described in § 10.8, also makes use of the geometrical concept of a simplex. Otherwise 
it is completely unrelated to the algorithm that we are describing in this section.) In 
general we are only interested in simplexes that are nondegenerate, i.e., that enclose 
a finite inner TV-dimensional volume. If any point of a nondegenerate simplex is 
taken as the origin, then the N other points define vector directions that span the 
iV-dimensional vector space. 

In one-dimensional minimization, it was possible to bracket a minimum, so that 
the success of a subsequent isolation was guaranteed. Alas! There is no analogous 
procedure in multidimensional space. For multidimensional minimization, the best 
we can do is give our algorithm a starting guess, that is, an iV-vector of independent 
variables as the first point to try. The algorithm is then supposed to make its own way 
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downhill through the unimaginable complexity of an iV-dimensional topography, 
until it encounters a (local, at least) minimum. 

The downhill simplex method must be started not just with a single point, but 
with N + 1 points, defining an initial simplex. If you think of one of these points 
(it matters not which) as being your initial starting point Po, then you can take 
the other N points to be 


Pi = P 0 + Aei 


(10.4.1) 


where the e»’s are N unit vectors, and where A is a constant which is your guess 
of the problem’s characteristic length scale. (Or, you could have different A j’s for 
each vector direction.) 

The downhill simplex method now takes a series of steps, most steps just moving 
the point of the simplex where the function is largest (“highest point”) through the 
opposite face of the simplex to a lower point. These steps are called reflections, 
and they are constructed to conserve the volume of the simplex (hence maintain 
its nondegeneracy). When it can do so, the method expands the simplex in one or 
another direction to take larger steps. When it reaches a “valley floor,” the method 
contracts itself in the transverse direction and tries to ooze down the valley. If there 
is a situation where the simplex is trying to “pass through the eye of a needle,” it 
contracts itself in all directions, pulling itself in around its lowest (best) point. The 
routine name amoeba is intended to be descriptive of this kind of behavior; the basic 
moves are summarized in Figure 10.4.1. 

Termination criteria can be delicate in any multidimensional minimization 
routine. Without bracketing, and with more than one independent variable, we 
no longer have the option of requiring a certain tolerance for a single independent 
variable. We typically can identify one “cycle” or “step” of our multidimensional 
algorithm. It is then possible to terminate when the vector distance moved in that 
step is fractionally smaller in magnitude than some tolerance tol. Alternatively, 
we could require that the decrease in the function value in the terminating step be 
fractionally smaller than some tolerance ftol. Note that while tol should not 
usually be smaller than the square root of the machine precision, it is perfectly 
appropriate to let ftol be of order the machine precision (or perhaps slightly larger 
so as not to be diddled by roundoff). 

Note well that either of the above criteria might be fooled by a single anomalous 
step that, for one reason or another, failed to get anywhere. Therefore, it is frequently 
a good idea to restart a multidimensional minimization routine at a point where 
it claims to have found a minimum. For this restart, you should reinitialize any 
ancillary input quantities. In the downhill simplex method, for example, you should 
reinitialize N of the N + 1 vertices of the simplex again by equation (10.4.1), with 
Po being one of the vertices of the claimed minimum. 

Restarts should never be very expensive; your algorithm did, after all, converge 
to the restart point once, and now you are starting the algorithm already there. 

Consider, then, our A^-dimensional amoeba: 
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#include <math.h> 

#include "nrutil.h" 

#define TINY 1.0e-10 A small number. 

#define NMAX 5000 Maximum allowed number of function evalua- 

#define GET.PSUM \ tions. 

for (j=l;j<=ndim;j++) {\ 

for (sum=0.0,i=l;i<=mpts;i++) sum += p[i][j];\ 
psum [ j ] =sum; }■ 

#define SWAP(a,b) {swap=(a);(a)=(b);(b)=swap;> 

void amoeba(float **p, float y[], int ndim, float ftol, 
float (*funk)(float []), int *nfunk) 

Multidimensional minimization of the function funk(x) where x[l. .ndim] is a vector in ndim 
dimensions, by the downhill simplex method of Nelder and Mead. The matrix p[l. .ndim+1] 
[1. .ndim] is input. Its ndim+1 rows are ndim-dimensional vectors which are the vertices of 
the starting simplex. Also input is the vector y [1. .ndim+1] , whose components must be pre¬ 
initialized to the values of funk evaluated at the ndim+1 vertices (rows) of p; and ftol the 
fractional convergence tolerance to be achieved in the function value (n.b.l). On output, p and 
y will have been reset to ndim+1 new points all within ftol of a minimum function value, and 
nfunk gives the number of function evaluations taken. 

{ 

float amotry(float **p, float y[], float psum[], int ndim, 
float (*funk) (float []), int ihi, float fac); 
int i,ihi,ilo,inhi, j ,mpts=ndim+l; 
float rtol,sum,swap,ysave,ytry,*psum; 

psum=vector(1,ndim); 

*nfunk=0; 

GET.PSUM 
for (;;) { 
ilo=l; 

First we must determine which point is the highest (worst), next-highest, and lowest 
(best), by looping over the points in the simplex, 
ihi = y[l]>y[2] ? (inhi=2,l) : (inhi=l,2); 
for (i=l;i<=mpts;i++) { 

if (y[i] <= ylilo]) ilo=i; 
if (y[i] > y [ihi]) { 
inhi=ihi; 
ihi=i; 

> else if (y[i] > y[inhi] && i != ihi) inhi=i; 

} 

rtol=2.0*fabs(y[ihi]-y[ilo])/(fabs(y[ihi])+fabs(y[ilo])+TINY); 

Compute the fractional range from highest to lowest and return if satisfactory, 
if (rtol < ftol) { If returning, put best point and value in slot 1. 

SWAP(y [1] ,y [ilo]) 

for (i=l;i<=ndim;i++) SWAP(p[1][i],p[ilo] [i]) 
break; 

} 

if (*nfunk >= NMAX) nrerrorO'NMAX exceeded"); 

*nfunk += 2; 

Begin a new iteration. First extrapolate by a factor —1 through the face of the simplex 
across from the high point, i.e., reflect the simplex from the high point. 
ytry=amotry(p,y,psum,ndim,funk,ihi,-1.0); 
if (ytry <= y[ilo]) 

Gives a result better than the best point, so try an additional extrapolation by a 
factor 2. 

ytry=amotry(p,y,psum,ndim,funk,ihi,2.0); 
else if (ytry >= y[inhi]) { 

The reflected point is worse than the second-highest, so look for an intermediate 
lower point, i.e., do a one-dimensional contraction. 
ysave=y[ihi]; 

ytry=amotry(p,y,psum,ndim,funk,ihi,0.5); 

if (ytry >= ysave) { Can't seem to get rid of that high point. Better 

for (i=l;i<=mpts;i++) { contract around the lowest (best) point. 
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} 


if (i != ilo) { 

for (j=l;j<=ndim;j++) 

P [i] [j 1 =psum [j ] =0.5* (p [i] [j ] +p [ilo] [j ] ); 
y [i] = (*funk)(psum); 

> 

> 


*nfunk += ndim; 
GET.PSUM 

> 

} else —(*nfunk); 

> 

free_vector(psum,l,ndim); 


Keep track of function evaluations. 

Recompute psum. 

Correct the evaluation count. 

Go back for the test of doneness and the next 
iteration. 


#include "nrutil.h" 

float amotry(float **p, float y[], float psum[] , int ndim, 
float (*funk)(float []), int ihi, float fac) 

Extrapolates by a factor fac through the face of the simplex across from the high point, tries 
it, and replaces the high point if the new point is better. 

{ 

int j; 

float facl,fac2,ytry,*ptry; 

ptry=vector(1,ndim); 
facl=(1.0-fac)/ndim; 
fac2=facl-fac; 

for (j=l; j<=ndim; j++) ptry[j] =psum[j] *f acl-p [ihi] [j] *f ac2; 
ytry=(*funk) (ptry) ; Evaluate the function at the trial point, 

if (ytry < y[ihi]) { If it’s better than the highest, then replace the highest, 

y[ihi] =ytry; 
for (j=l;j<=ndim;j++) { 

psum[j] += ptry [j]-p[ihi] [j] ; 
p [ihi] [j]=ptry[j] ; 

> 

> 

free_vector(ptry,1,ndim); 
return ytry; 


CITED REFERENCES AND FURTHER READING: 

Nelder, J.A., and Mead, R. 1965, Computer Journal, vol. 7, pp. 308-313. [1] 

Yarbro, L.A., and Deming, S.N. 1974, Analytica ChimicaActa, vol. 73, pp. 391-398. 

Jacoby, S.L.S, Kowalik, J.S., and Pizzo, J.T. 1972, Iterative Methods for Nonlinear Optimization 
Problems (Englewood Cliffs, NJ: Prentice-Hall). 



10.5 Direction Set (Powell’s) Methods in 
Multidimensions 


s o- i 


We know (§10.1—§10.3) how to minimize a function of one variable. If we 
start at a point P in iV-dimensional space, and proceed from there in some vector 
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direction n, then any function of N variables /(P) can be minimized along the line 
n by our one-dimensional methods. One can dream up various multidimensional 
minimization methods that consist of sequences of such line minimizations. Different 
methods will differ only by how, at each stage, they choose the next direction n to 
try. All such methods presume the existence of a “black-box” sub-algorithm, which 
we might call linmin (given as an explicit routine at the end of this section), whose 
definition can be taken for now as 


linmin: Given as input the vectors P and n, and the 
function /, find the scalar A that minimizes /(P + An). 
Replace P by P + An. Replace n by An. Done. 


All the minimization methods in this section and in the two sections following 
fall under this general schema of successive line minimizations. (The algorithm 
in §10.7 does not need very accurate line minimizations. Accordingly, it has its 
own approximate line minimization routine, lnsrch.) In this section we consider 
a class of methods whose choice of successive directions does not involve explicit 
computation of the function’s gradient; the next two sections do require such gradient 
calculations. You will note that we need not specify whether linmin uses gradient 
information or not. That choice is up to you, and its optimization depends on your 
particular function. You would be crazy, however, to use gradients in linmin and 
not use them in the choice of directions, since in this latter role they can drastically 
reduce the total computational burden. 

But what if, in your application, calculation of the gradient is out of the question. 
You might first think of this simple method: Take the unit vectors e i, ,. ■ ■ Cn as a 

set of directions. Using linmin, move along the first direction to its minimum, then 
from there along the second direction to its minimum, and so on, cycling through the 
whole set of directions as many times as necessary, until the function stops decreasing. 

This simple method is actually not too bad for many functions. Even more 
interesting is why it is bad, i.e. very inefficient, for some other functions. Consider 
a function of two dimensions whose contour map (level lines) happens to define a 
long, narrow valley at some angle to the coordinate basis vectors (see Figure 10.5.1). 
Then the only way “down the length of the valley” going along the basis vectors at 
each stage is by a series of many tiny steps. More generally, in N dimensions, if 
the function’s second derivatives are much larger in magnitude in some directions 
than in others, then many cycles through all N basis vectors will be required in 
order to get anywhere. This condition is not all that unusual; according to Murphy’s 
Law, you should count on it. 

Obviously what we need is a better set of directions than the e j’s. All direction 
set methods consist of prescriptions for updating the set of directions as the method 
proceeds, attempting to come up with a set which either (i) includes some very 
good directions that will take us far along narrow valleys, or else (more subtly) 
(ii) includes some number of “non-interfering” directions with the special property 
that minimization along one is not “spoiled” by subsequent minimization along 
another, so that interminable cycling through the set of directions can be avoided. 
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Figure 10.5.1. Successive minimizations along coordinate directions in a long, narrow “valley” (shown 
as contour lines). Unless the valley is optimally oriented, this method is extremely inefficient, taking 
many tiny steps to get to the minimum, crossing and re-crossing the principal axis. 

Conjugate Directions 


This concept of “non-interfering” directions, more conventionally called con¬ 
jugate directions, is worth making mathematically explicit. 

First, note that if we minimize a function along some direction u, then the 
gradient of the function must be perpendicular to u at the line minimum; if not, then 
there would still be a nonzero directional derivative along u. 

Next take some particular point P as the origin of the coordinate system with 
coordinates x. Then any function / can be approximated by its Taylor series 


where 


/w=/(p)+i:^.+5E 


a 3 / 

' dxidx, 


XiXj + • 


c — b x 


- x • A. • x 


c=/(P) b = —V/|p 




d 2 f 

dxjdxj p 


(10.5.1) 


(10.5.2) 



The matrix A whose components are the second partial derivative matrix of the 
function is called the Hessian matrix of the function at P. 
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In the approximation of (10.5.1), the gradient of / is easily calculated as 

V/ = A • x - b (10.5.3) 

(This implies that the gradient will vanish — the function will be at an extremum — 
at a value of x obtained by solving A • x = b. This idea we will return to in §10.7!) 

How does the gradient V/ change as we move along some direction? Evidently 

5(V/) = A • (Sx) (10.5.4) 

Suppose that we have moved along some direction u to a minimum and now 
propose to move along some new direction v. The condition that motion along v not 
spoil our minimization along u is just that the gradient stay perpendicular to u, i.e., 
that the change in the gradient be perpendicular to u. By equation (10.5.4) this is just 

0 = u • <S(V/) = u • A • v (10.5.5) 

When (10.5.5) holds for two vectors u and v, they are said to be conjugate. 
When the relation holds pairwise for all members of a set of vectors, they are said 
to be a conjugate set. If you do successive line minimization of a function along 
a conjugate set of directions, then you don’t need to redo any of those directions 
(unless, of course, you spoil things by minimizing along a direction that they are 
not conjugate to). 

A triumph for a direction set method is to come up with a set of N linearly 
independent, mutually conjugate directions. Then, one pass of N line minimizations 
will put it exactly at the minimum of a quadratic form like (10.5.1). For functions 
/ that are not exactly quadratic forms, it won’t be exactly at the minimum; but 
repeated cycles of N line minimizations will in due course converge quadratically 
to the minimum. 

Powell’s Quadratically Convergent Method 

Powell first discovered a direction set method that does produce N mutually 
conjugate directions. Here is how it goes: Initialize the set of directions u, to 
the basis vectors, 

u i=d i = l,...,N (10.5.6) 

Now repeat the following sequence of steps (“basic procedure”) until your function 
stops decreasing: 

• Save your starting position as Po- 

• For i = 1..... W, move P, i to the minimum along direction Ui and 
call this point P;. 

• For i = 1 ,,N — 1, set Uj <— Uj+i. 

• Set ujv <— Pat — Po- 

• Move Pjv to the minimum along direction u jv and call this point Pq. 
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Powell, in 1964, showed that, for a quadratic form like (10.5.1), k iterations 
of the above basic procedure produce a set of directions u * whose last k members 
are mutually conjugate. Therefore, N iterations of the basic procedure, amounting 
to N(N + 1) line minimizations in all, will exactly minimize a quadratic form. 
Brent [1 ] gives proofs of these statements in accessible form. 

Unfortunately, there is a problem with Powell’s quadratically convergent al¬ 
gorithm. The procedure of throwing away, at each stage, u i in favor of P jv — Po 
tends to produce sets of directions that “fold up on each other” and become linearly 
dependent. Once this happens, then the procedure finds the minimum of the function 
/ only over a subspace of the full iV-dimensional case; in other words, it gives the 
wrong answer. Therefore, the algorithm must not be used in the form given above. 

There are a number of ways to fix up the problem of linear dependence in 
Powell’s algorithm, among them: 

1. You can reinitialize the set of directions u , to the basis vectors e, after every 
N or N + 1 iterations of the basic procedure. This produces a serviceable method, 
which we commend to you if quadratic convergence is important for your application 
(i.e., if your functions are close to quadratic forms and if you desire high accuracy). 

2. Brent points out that the set of directions can equally well be reset to 
the columns of any orthogonal matrix. Rather than throw away the information 
on conjugate directions already built up, he resets the direction set to calculated 
principal directions of the matrix A (which he gives a procedure for determining). 
The calculation is essentially a singular value decomposition algorithm (see §2.6). 
Brent has a number of other cute tricks up his sleeve, and his modification of 
Powell’s method is probably the best presently known. Consult [1 ] for a detailed 
description and listing of the program. Unfortunately it is rather too elaborate for 
us to include here. 

3. You can give up the property of quadratic convergence in favor of a more 
heuristic scheme (due to Powell) which tries to find a few good directions along 
narrow valleys instead of N necessarily conjugate directions. This is the method 
that we now implement. (It is also the version of Powell’s method given in Acton [2], 
from which parts of the following discussion are drawn.) 

Discarding the Direction of Largest Decrease 

The fox and the grapes: Now that we are going to give up the property of 
quadratic convergence, was it so important after all? That depends on the function 
that you are minimizing. Some applications produce functions with long, twisty 
valleys. Quadratic convergence is of no particular advantage to a program which 
must slalom down the length of a valley floor that twists one way and another (and 
another, and another, ... - there are N dimensions!). Along the long direction, 
a quadratically convergent method is trying to extrapolate to the minimum of a 
parabola which just isn’t (yet) there; while the conjugacy of the N — 1 transverse 
directions keeps getting spoiled by the twists. 

Sooner or later, however, we do arrive at an approximately ellipsoidal minimum 
(cf. equation 10.5.1 when b, the gradient, is zero). Then, depending on how much 
accuracy we require, a method with quadratic convergence can save us several times 
TV 2 extra line minimizations, since quadratic convergence doubles the number of 
significant figures at each iteration. 
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The basic idea of our now-modified Powell’s method is still to take P n — Po as 
a new direction; it is, after all, the average direction moved after trying all N possible 
directions. For a valley whose long direction is twisting slowly, this direction is 
likely to give us a good run along the new long direction. The change is to discard 
the old direction along which the function / made its largest decrease. This seems 
paradoxical, since that direction was the best of the previous iteration. However, it 
is also likely to be a major component of the new direction that we are adding, so 
dropping it gives us the best chance of avoiding a buildup of linear dependence. 

There are a couple of exceptions to this basic idea. Sometimes it is better not 
to add a new direction at all. Define 


/o = /(Po) fN = f(P N ) /e = /(2Pjv — P 0 ) (10.5.7) 

Here Je is the function value at an “extrapolated” point somewhat further along 
the proposed new direction. Also define A/ to be the magnitude of the largest 
decrease along one particular direction of the present basic procedure iteration. (A/ 
is a positive number.) Then: 

1. If /e > / 0 , then keep the old set of directions for the next basic procedure, 
because the average direction — Po is all played out. 

2. If 2 (/ 0 - 2 f N + f E ) [(/o - fi v) - A/] 2 > (/ 0 - f E ) 2 A/, then keep the old 
set of directions for the next basic procedure, because either (i) the decrease along 
the average direction was not primarily due to any single direction’s decrease, or (ii) 
there is a substantial second derivative along the average direction and we seem to 
be near to the bottom of its minimum. 

The following routine implements Powell’s method in the version just described. 
In the routine, xi is the matrix whose columns are the set of directions n, ; otherwise 
the correspondence of notation should be self-evident. 

#include <math.h> 

#include "nrutil.h" 

#define TINY 1.0e-25 A small number. 

#define ITMAX 200 Maximum allowed iterations. 

void powell(float p[], float **xi, int n, float ftol, int *iter, float *fret, 
float (*func)(float [])) 

Minimization of a function func of n variables. Input consists of an initial starting point 
p[l. .n] ; an initial matrix xi [1. .n] [1. .n] , whose columns contain the initial set of di¬ 
rections (usually the n unit vectors); and ftol, the fractional tolerance in the function value 
such that failure to decrease by more than this amount on one iteration signals doneness. On 
output, p is set to the best point found, xi is the then-current direction set, fret is the returned 
function value at p, and iter is the number of iterations taken. The routine linmin is used. 
{ 

void linmin(float p[], float xi[], int n, float *fret, 
float (*func)(float [])); 
int i,ibig,j; 

float del,fp,fptt,t,*pt,*ptt,*xit; 

pt=vector(l,n); 
ptt=vector(l,n); 
xit=vector(l,n); 

*fret=(*func)(p); 
for (j=l;j<=n;j++) pt[j]=p[j]; 
for (*iter=l;;++(*iter)) { 
fp=(*fret); 
ibig=0; 



Save the initial point. 
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del=0.0; Will be the biggest function decrease, 

for (i=l;i<=n;i++) { In each iteration, loop over all directions in the set. 

for (j=l; j<=n; j++) xit[j] =xi[j] [i] ; Copy the direction, 

fptt=(*fret) ; 

liiminCp^itjn^retjfmic); minimize along it, 

if (fptt-(*fret) > del) { and record it if it is the largest decrease 

del=fptt-(*fret) ; so far. 

ibig=i; 

> 

> 

if (2.0*(fp-(*fret)) <= ftol*(fabs(fp)+fabs(*fret))+TINY) { 
free_vector(xit,l,n) ; Termination criterion. 

free_vector(ptt,l,n); 
free_vector(pt,1,n); 
return; 


if (*iter == ITMAX) nrerror("powell exceeding maximum iterations."); 
for (j=l; j<=n; j++) { Construct the extrapolated point and the 

ptt[j]=2.0*p[j]-pt[j] ; average direction moved. Save the 

xit[j]=p[j]-pt[j] ; old starting point. 

pt[j]=p[j] ; 

} 

fptt=(*func) (ptt) ; Function value at extrapolated point, 

if (fptt < fp) { 

t=2.0*(fp-2.0*(*fret)+fptt)*SQE(fp-(*fret)-del)-del*SQR(fp-fptt); 
if (t < 0.0) { 


linmin(p,xit,n,fret,func); 
for (j=l;j<=n;j++) { 

xi[j] [ibig]=xi[j] [n] ; 
xi [j] [n] =xit [j] ; 


Move to the minimum of the new direc¬ 
tion, and save the new direction. 


Back for another iteration. 


Implementation of Line Minimization 

Make no mistake, there is a right way to implement linmin: It is to use 
the methods of one-dimensional minimization described in §10.1—§10.3, but to 
rewrite the programs of those sections so that their bookkeeping is done on vector¬ 
valued points P (all lying along a given direction n) rather than scalar-valued 
abscissas x. That straightforward task produces long routines densely populated 
with “for(k=l;k<=n;k++)” loops. 

We do not have space to include such routines in this book. Our linmin, which 
works just fine, is instead a kind of bookkeeping swindle. It constructs an “artificial” 
function of one variable called f ldim, which is the value of your function, say, 
f unc, along the line going through the point p in the direction xi. linmin calls our 
familiar one-dimensional routines mnbrak (§10.1) and brent (§10.3) and instructs 
them to minimize fldim. linmin communicates with f ldim “over the head” of 
mnbrak and brent, through global (external) variables. That is also how it passes 
to fldim a pointer to your user-supplied function. 

The only thing inefficient about linmin is this: Its use as an interface between a 
multidimensional minimization strategy and a one-dimensional minimization routine 
results in some unnecessary copying of vectors hither and yon. That should not 
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normally be a significant addition to the overall computational burden, but we cannot 
disguise its inelegance. 


tinclude "nrutil.h" 

#define TOL 2.0e-4 Tolerance passed to brent. 

int ncom; Global variables communicate with fldim. 

float *pcom,*xicom,(*nrfunc)(float []); 

void linmin(float p[], float xi[], int n, float *fret, float (*func) (float □)) 
Given an n-dimensional point p[l. .n] and an n-dimensional direction xi [1. .n] , moves and 
resets p to where the function func(p) takes on a minimum along the direction xi from p, 
and replaces xi by the actual vector displacement that p was moved. Also returns as fret 
the value of func at the returned location p. This is actually all accomplished by calling the 
routines mnbrak and brent. 

{ 

float brent(float ax, float bx, float cx, 

float (*f)(float), float tol, float *xmin); 
float fldim(float x); 

void mnbrak(float *ax, float *bx, float *cx, float *fa, float *fb, 
float *fc, float (*func)(float)); 
int j; 

float xx.xmin.fx.fb.fa.bxjax; 


ncom=n; Define the global variables. 

pcom=vector(1,n); 
xicom=vector(l,n); 
nrfunc=func; 
for (j=l;j<=n;j++) { 
pcom[j]=p [ j]; 
xicom[j]=xi [j] ; 

> 

ax=0.0; Initial guess for brackets. 

xx=l.0; 

mnbrak(&ax,&xx,&bx,&f a,&fx,&fb,fldim); 

*f ret=brent (ax, xx ,bx, f ldim, TOL, Stxmin); 

for (j=l;j<=n;j++) { Construct the vector results to return. 

xi[j] *= xmin; 
p[j] += xi [j] ; 

> 

free_vector(xicom,l,n); 
free_vector(pcom,1,n); 


#include "nrutil.h" 

extern int ncom; Defined in linmin. 

extern float *pcom,*xicom,(*nrfunc)(float []); 

float fldim(float x) 

Must accompany linmin. 

{ 

int j; 

float f,*xt; 
xt=vector(1,ncom); 

for (j=l; j<=ncom; j++) xt [j] =pcom[j]+x*xicom[j] ; 
f=(*nrfunc)(xt); 
free_vector(xt,1,ncom); 
return f; 
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CITED REFERENCES AND FURTHER READING: 

Brent, R.R 1973, Algorithms for Minimization without Derivatives (Englewood Cliffs, NJ: Prentice- 
Hall), Chapter 7. [1] 

Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), pp. 464-467. [2] 

Jacobs, D.A.H. (ed.) 1977, The State of the Art in Numerical Analysis (London: Academic 
Press), pp. 259-262. 


10.6 Conjugate Gradient Methods in 
Multidimensions 


We consider now the case where you are able to calculate, at a given TV- 
dimensional point P, not just the value of a function /(P) but also the gradient 
(vector of first partial derivatives) V/(P). 

A rough counting argument will show how advantageous it is to use the gradient 
information: Suppose that the function / is roughly approximated as a quadratic 
form, as above in equation (10.5.1), 

/(x) w c - b-x -g fx-A-x (10.6.1) 

Then the number of unknown parameters in / is equal to the number of free 
parameters in A and b, which is \N(N + 1), which we see to be of order N 2 . 
Changing any one of these parameters can move the location of the minimum. 
Therefore, we should not expect to be able to find the minimum until we have 
collected an equivalent information content, of order N 2 numbers. 

In the direction set methods of §10.5, we collected the necessary information by 
making on the order of N 2 separate line minimizations, each requiring “a few” (but 
sometimes a big few!) function evaluations. Now, each evaluation of the gradient 
will bring us N new components of information. If we use them wisely, we should 
need to make only of order N separate line minimizations. That is in fact the case 
for the algorithms in this section and the next. 

A factor of N improvement in computational speed is not necessarily implied. 
As a rough estimate, we might imagine that the calculation of each component of 
the gradient takes about as long as evaluating the function itself. In that case there 
will be of order N 2 equivalent function evaluations both with and without gradient 
information. Even if the advantage is not of order N, however, it is nevertheless 
quite substantial: (i) Each calculated component of the gradient will typically save 
not just one function evaluation, but a number of them, equivalent to, say, a whole 
line minimization, (ii) There is often a high degree of redundancy in the formulas 
for the various components of a function’s gradient; when this is so, especially when 
there is also redundancy with the calculation of the function, then the calculation of 
the gradient may cost significantly less than N function evaluations. 



S, § g 
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Figure 10.6.1. (a) Steepest descent method in a long, narrow “valley.” While more efficient than the 

strategy of Figure 10.5.1, steepest descent is nonetheless an inefficient strategy, taking many steps to 
reach the valley floor, (b) Magnified view of one step: A step starts off in the local gradient direction, 
perpendicular to the contour lines, and traverses a straight line until a local minimum is reached, where 
the traverse is parallel to the local contour lines. 

A common beginner’s error is to assume that any reasonable way of incorporating 
gradient information should be about as good as any other. This line of thought leads 
to the following not very good algorithm, the steepest descent method: 


Steepest Descent: Start at a point Po. As many times 
as needed, move from point P, to the point Pj + i by 
minimizing along the line from P* in the direction of 
the local downhill gradient — V/(P»). 


The problem with the steepest descent method (which, incidentally, goes back 
to Cauchy), is similar to the problem that was shown in Figure 10.5.1. The method 
will perform many small steps in going down a long, narrow valley, even if the valley 
is a perfect quadratic form. You might have hoped that, say in two dimensions, 
your first step would take you to the valley floor, the second step directly down 
the long axis; but remember that the new gradient at the minimum point of any 
line minimization is perpendicular to the direction just traversed. Therefore, with 
the steepest descent method, you must make a right angle turn, which does not, in 
general, take you to the minimum. (See Figure 10.6.1.) 

Just as in the discussion that led up to equation (10.5.5), we really want a way 
of proceeding not down the new gradient, but rather in a direction that is somehow 
constructed to be conjugate to the old gradient, and, insofar as possible, to all 
previous directions traversed. Methods that accomplish this construction are called 
conjugate gradient methods. 

In §2.7 we discussed the conjugate gradient method as a technique for solving 
linear algebraic equations by minimizing a quadratic form. That formalism can also 
be applied to the problem of minimizing a function approximated by the quadratic 
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form (10.6.1). Recall that, starting with an arbitrary initial vector g 0 and letting 
h 0 = g 0 , the conjugate gradient method constructs two sequences of vectors from 
the recurrence 


gj+i = gj - AjA • ^ h i+ i = g i+1 + 7 ^ ( = 0 , 1 , 2 ,... ( 10 . 6 . 2 ) 


The vectors satisfy the orthogonality and conjugacy conditions 
g* ■g j =0 hj • A • hj = 0 gi • h, = 0 j <i 
The scalars \ and 7 , are given by 

> = gj -gj = gj - hi 

* hj • A • hj hj A • hi 


(10.6.3) 


(10.6.4) 

(10.6.5) 


Equations (10.6.2)-(10.6.5) are simply equations (2.7.32)-(2.7.35) for a symmetric 
A in a new notation. (A self-contained derivation of these results in the context of 
function minimization is given by Polak [1].) 

Now suppose that we knew the Hessian matrix A in equation (10.6.1). Then 
we could use the construction ( 10 . 6 . 2 ) to find successively conjugate directions h, 
along which to line-minimize. After N such, we would efficiently have arrived at 
the minimum of the quadratic form. But we don’t know A. 

Here is a remarkable theorem to save the day: Suppose we happen to have 
g, = — V/(Pi), for some point Pi, where / is of the form (10.6.1). Suppose that we 
proceed from Pj along the direction hj to the local minimum of / located at some 
point Pi+i and then set g i+1 = — V/(Pi+i). Then, this g i+1 is the same vector 
as would have been constructed by equation (10.6.2). (And we have constructed 
it without knowledge of A!) 

Proof: By equation (10.5.3), g, = —A • P, + b, and 


g . +1 = -A • (Pi + Ahj) + b = gj — AA • hi (10.6.6) 


with A chosen to take us to the line minimum. But at the line minimum h , • V/ = 
—hi • g i+1 = 0. This latter condition is easily combined with (10.6.6) to solve for 
A. The result is exactly the expression (10.6.4). But with this value of A, (10.6.6) 
is the same as ( 10 . 6 . 2 ), q.e.d. 

We have, then, the basis of an algorithm that requires neither knowledge of the 
Hessian matrix A, nor even the storage necessary to store such a matrix. A sequence 
of directions hj is constructed, using only line minimizations, evaluations of the 
gradient vector, and an auxiliary vector to store the latest in the sequence of g’s. 

The algorithm described so far is the original Fletcher-Reeves version of the 
conjugate gradient algorithm. Later, Polak and Ribiere introduced one tiny, but 
sometimes significant, change. They proposed using the form 


7 * = 


(gj+l g i) ' gj+1 



gi ’gi 


(10.6.7) 
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instead of equation (10.6.5). “Wait,” you say, “aren’t they equal by the orthogonality 
conditions (10.6.3)?” They are equal for exact quadratic forms. In the real world, 
however, your function is not exactly a quadratic form. Arriving at the supposed 
minimum of the quadratic form, you may still need to proceed for another set of 
iterations. There is some evidence [2] that the Polak-Ribiere formula accomplishes 
the transition to further iterations more gracefully: When it runs out of steam, it 
tends to reset h to be down the local gradient, which is equivalent to beginning the 
conjugate-gradient procedure anew. 

The following routine implements the Polak-Ribiere variant, which we recom¬ 
mend; but changing one program line, as shown, will give you Fletcher-Reeves. The 
routine presumes the existence of a function func(p), where p[l. .n] is a vector 
of length n, and also presumes the existence of a function dfunc(p,df) that sets 
the vector gradient df [1. .n] evaluated at the input point p. 

The routine calls linmin to do the line minimizations. As already discussed, 
you may wish to use a modified version of linmin that uses dbrent instead of 
brent, i.e., that uses the gradient in doing the line minimizations. See note below. 

#include <math.h> 

#include "nrutil.h" 

#define ITMAX 200 
#define EPS 1.0e-10 

Here ITMAX is the maximum allowed number of iterations, while EPS is a small number to 
rectify the special case of converging to exactly zero function value. 

#define FREEALL free_vector(xi,1,n);free_vector(h,1,n);free_vector(g,1,n); 

void frprmn(float p[], int n, float ftol, int *iter, float *fret, 
float (*func) (float []), void (*dfunc) (float [] , float [])) 

Given a starting point p[l. .n] , Fletcher-Reeves-Polak-Ribiere minimization is performed on a 
function func, using its gradient as calculated by a routine dfunc. The convergence tolerance 
on the function value is input as ftol. Returned quantities are p (the location of the minimum), 
iter (the number of iterations that were performed), and fret (the minimum value of the 
function). The routine linmin is called to perform line minimizations. 

{ 

void linmin(float p[], float xi[], int n, float *fret, 
float (*func)(float [])); 
int j,its; 

float gg,gam,fp,dgg; 
float *g,*h,*xi; 

g=vector(l,n); 
h=vector(l,n); 
xi=vector(1,n); 

fp=(*func) (p); Initializations. 

(*dfunc)(p,xi); 
for (j=l;j<=n;j++) { 
g[j] = -xi[j] ; 
xi [j]=h[j]=g[j] ; 

} 

for (its=l;its<=ITMAX;its++) { Loop over iterations. 

*iter=its; 

linmin(p,xi,n,fret,func) ; Next statement is the normal return: 

if (2.0*fabs(*fret-fp) <= ftol*(fabs(*fret)+fabs(fp)+EPS)) { 

FREEALL 
return; 

> 

fp= *fret; 

(*dfunc)(p,xi); 

dgg=gg=0.0; 

for (j=l;j<=n;j++) { 
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gg += g[j]*g[j]; 

/* dgg += xi[j]*xi[j]; */ 

dgg += (xi[j]+g[j])*xi[j] ; 

> 

if (gg == 0.0) { 

FREEALL 
return; 

> 

gam=dgg/gg; 
for (j=i;j<=n;j++) { 


This statement for Fletcher-Reeves. 
This statement for Polak-Ribiere. 

Unlikely. If gradient is exactly zero then 
we are already done. 


g[J] = - xi [j]; 

xi [j] =h[j] =g[j] +gam*h[j] 


nrerrorC'Too many iterations in frprmn"); 

> 


Note on Line Minimization Using Derivatives 

Kindly reread the last part of §10.5. We here want to do the same thing, but 
using derivative information in performing the line minimization. 

The modified version of linmin, called dlinmin, and its required companion 
routine dfldim follow: 

#include "nrutil.h" 

#define TOL 2.0e-4 Tolerance passed to dbrent. 

int ncom; Global variables communicate with dfldim. 

float *pcom,*xicom,(*nrfunc)(float []); 
void (*nrdfun) (float [] , float [] ) ; 

void dlinmin(float p[], float xi[], int n, float *fret, float (*func) (float □), 
void (*dfunc) (float [], float [])) 

Given an n-dimensional point p[l. .n] and an n-dimensional direction xi [1. .n] , moves and 
resets p to where the function func(p) takes on a minimum along the direction xi from p, 
and replaces xi by the actual vector displacement that p was moved. Also returns as fret 
the value of func at the returned location p. This is actually all accomplished by calling the 
routines mnbrak and dbrent. 

{ 

float dbrent(float ax, float bx, float cx, 

float (*f)(float), float (*df)(float), float tol, float *xmin); 
float fldim(float x); 
float dfldim(float x); 

void mnbrak(float *ax, float *bx, float *cx, float *fa, float *fb, 
float *fc, float (*func)(float)); 
int j; 

float xx,xmin,fx,fb,fa,bx,ax; 



ncom=n; Define the global variables. 

pcom=vector(1,n); 
xicom=vector(l,n); 
nrfunc=func; 
nrdfun=dfunc; 
for (j=l;j<=n;j++) { 
pcom[j]=p [ j]; 
xicom[j]=xi [j] ; 

> 

ax=0.0; Initial guess for brackets. 

xx=l.0; 

mnbrak(&ax,&xx,&bx,&fa,&fx,&fb,fldim); 


S. I | 
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*fret=dbrent(ax,xx,bx,fldim,dfldim,TOL,&xmin); 
for (j=l;j<=n;j++) { Construct the vector results to return. 

xi[j] *= xmin; 
p[j] += xi[j]; 

> 

free_vector(xicom,l,n); 
free_vector(pcom,1,n); 


#include "nrutil.h" 

extern int ncom; Defined in dlinmin. 

extern float *pcom,*xicom,(*nrfunc)(float []); 
extern void (*nrdfun) (float [] , float []); 

float dfldim(float x) 

{ 

int j; 

float df1=0.0; 
float *xt,*df; 

xt=vector(1,ncom); 
df=vector(1,ncom); 

for (j=l; j<=ncom; j++) xt [j] =pcom[j]+x*xicom[j] ; 
(*nrdfun)(xt,df); 

for (j=l; j<=ncom; j++) dfl += df [j] *xicom[j] ; 
free_vector(df,1,ncom); 
free_vector(xt,1,ncom); 
return dfl; 


CITED REFERENCES AND FURTHER READING: 

Polak, E. 1971, Computational Methods in Optimization (New York: Academic Press), §2.3. [1] 
Jacobs, D.A.H. (ed.) 1977, The State of the Art in Numerical Analysis (London: Academic Press), 
Chapter 111.1.7 (by K.W. Brodlie). [2] 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
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10.7 Variable Metric Methods in 
Multidimensions 

The goal of variable metric methods, which are sometimes called quasi-Newton 
methods, is not different from the goal of conjugate gradient methods: to accumulate 
information from successive line minimizations so that N such line minimizations 
lead to the exact minimum of a quadratic form in N dimensions. In that case, the 
method will also be quadratically convergent for more general smooth functions. 

Both variable metric and conjugate gradient methods require that you are able to 
compute your function’s gradient, or first partial derivatives, at arbitrary points. The 
variable metric approach differs from the conjugate gradient in the way that it stores 
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and updates the information that is accumulated. Instead of requiring intermediate 
storage on the order of N, the number of dimensions, it requires a matrix of size 
N x N. Generally, for any moderate N, this is an entirely trivial disadvantage. 

On the other hand, there is not, as far as we know, any overwhelming advantage 
that the variable metric methods hold over the conjugate gradient techniques, except 
perhaps a historical one. Developed somewhat earlier, and more widely propagated, 
the variable metric methods have by now developed a wider constituency of satisfied 
users. Likewise, some fancier implementations of variable metric methods (going 
beyond the scope of this book, see below) have been developed to a greater level of 
sophistication on issues like the minimization of roundoff error, handling of special 
conditions, and so on. We tend to use variable metric rather than conjugate gradient, 
but we have no reason to urge this habit on you. 

Variable metric methods come in two main flavors. One is the Davidon-Fletcher- 
Powell (DFP) algorithm (sometimes referred to as simply Fletcher-Powell ). The 
other goes by the name Broyden-Fletcher-Goldfarb-Shanno (BFGS). The BFGS and 
DFP schemes differ only in details of their roundoff error, convergence tolerances, 
and similar “dirty” issues which are outside of our scope [1,2]. However, it has 
become generally recognized that, empirically, the BFGS scheme is superior in these 
details. We will implement BFGS in this section. 

As before, we imagine that our arbitrary function /(x) can be locally approx¬ 
imated by the quadratic form of equation (10.6.1). We don’t, however, have any 
information about the values of the quadratic form’s parameters A and b, except 
insofar as we can glean such information from our function evaluations and line 
minimizations. 

The basic idea of the variable metric method is to build up, iteratively, a good 
approximation to the inverse Hessian matrix A -1 , that is, to construct a sequence 
of matrices H, with the property, 

lim ^ = A -1 (10.7.1) 

Even better if the limit is achieved after N iterations instead of oo. 

The reason that variable metric methods are sometimes called quasi-Newton 
methods can now be explained. Consider finding a minimum by using Newton’s 
method to search for a zero of the gradient of the function. Near the current point 
Xj, we have to second order 

/(x) = /(Xi) + (x - Xi) • V/(xj) #|(x - Xi) • A • (x — Xi) (10.7.2) 

SO 

V/(x) = V/(xj) + A • (x — Xj) (10.7.3) 

In Newton’s method we set V/(x) = 0 to determine the next iteration point: 

x - Xi = —A -1 • V/(xi) (10.7.4) 

The left-hand side is the finite step we need take to get to the exact minimum; the 
right-hand side is known once we have accumulated an accurate H « A _1 . 

The “quasi” in quasi-Newton is because we don’t use the actual Hessian matrix 
of /, but instead use our current approximation of it. This is often better than 
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using the true Hessian. We can understand this paradoxical result by considering the 
descent directions of / at Xj. These are the directions p along which / decreases: 
V/-p < 0. For the Newton direction (10.7.4) to be a descent direction, we must have 

V/(xj) • (x - Xj) = -(x - Xj) • A • (x - Xj) < 0 (10.7.5) 

which is true if A is positive definite. In general, far from a minimum, we have no 
guarantee that the Hessian is positive definite. Taking the actual Newton step with 
the real Hessian can move us to points where the function is increasing in value. 
The idea behind quasi-Newton methods is to start with a positive definite, symmetric 
approximation to A (usually the unit matrix) and build up the approximating H j’s 
in such a way that the matrix Hj remains positive definite and symmetric. Far from 
the minimum, this guarantees that we always move in a downhill direction. Close 
to the minimum, the updating formula approaches the true Hessian and we enjoy 
the quadratic convergence of Newton’s method. 

When we are not close enough to the minimum, taking the full Newton step 
p even with a positive definite A need not decrease the function; we may move 
too far for the quadratic approximation to be valid. All we are guaranteed is that 
initially / decreases as we move in the Newton direction. Once again we can use 
the backtracking strategy described in §9.7 to choose a step along the direction of 
the Newton step p, but not necessarily all the way. 

We won’t rigorously derive the DFP algorithm for taking Hj into H i+ i; you 
can consult [3] for clear derivations. Following Brodlie (in [2]), we will give the 
following heuristic motivation of the procedure. 

Subtracting equation (10.7.4) at x i+1 from that same equation at Xj gives 

x i+1 - Xj = A” 1 • (V/j+i - V/j) (10.7.6) 

where V/j = V/(xj). Having made the step from Xj to Xj+i, we might reasonably 
want to require that the new approximation H i+ i satisfy (10.7.6) as if it were 
actually A -1 , that is, 


x i+1 - Xj = Hj +1 • (V/j+i - V/j) (10.7.7) 

We might also imagine that the updating formula should be of the form H j+ i = 
Hj + correction. 

What “objects” are around out of which to construct a correction term? Most 
notable are the two vectors Xj+i — x* and V/j+i - V/,; and there is also Hj. 
There are not infinitely many natural ways of making a matrix out of these objects, 
especially if (10.7.7) must hold! One such way, the DFP updating formula, is 

H -HI ( Xi+1 ~ ® ( Xi+1 “ X ^ 

i+1 i+ (xj +1 -Xj).(V/j +1 -V/j) 

[Hj • (V/j+i - V/j)] 0 [Hj • (V/ i+1 - V/j)] 1 

(V/j+i - V/j) • Hj • (V/j+i - V/j) 



where 0 denotes the “outer” or “direct” product of two vectors, a matrix: The ij 
component of u0 v is Uj Vj . (You might want to verify that 10.7.8 does satisfy 10.7.7.) 
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The BFGS updating formula is exactly the same, but with one additional term, 

• • • + [(V/j+i - V/i) • Hi • (V/i+i - V/i)] u ® u (10.7.9) 


where u is defined as the vector 

~ x i) 

(x i+1 -Xi).(V/i +1 -V/i) 

_ H, • (V/j+i - V/i) 

(V/i+i - V/i) • H, • (V/i+i - V/i) 


(10.7.10) 



(You might also verify that this satisfies 10.7.7.) 

You will have to take on faith — or else consult [3] for details of — the “deep” 
result that equation (10.7.8), with or without (10.7.9), does in fact converge to A ~~ 1 
in N steps, if / is a quadratic form. 

Here now is the routine df pmin that implements the quasi-Newton method, and 
uses lnsrch from §9.7. As mentioned at the end of newt in §9.7, this algorithm 
can fail if your variables are badly scaled. 

#include <math.h> 

#include "nrutil.h" 

#define ITMAX 200 Maximum allowed number of iterations. 

#define EPS 3.0e-8 Machine precision. 

#define T0LX (4*EPS) Convergence criterion on x values. 

#define STPMX 100.0 Scaled maximum step length allowed in 

line searches. 

#define FREEALL free_vector(xi,1,n);free_vector(pnew,1,n); \ 
free_matrix(hessin,1,n,l,n);free_vector(hdg,1,n);free_vector(g,1,n); \ 
free_vector(dg,l,n); 

void dfpmin(float p[], int n, float gtol, int ♦iter, float *fret, 
float(*func) (float []), void (♦dfunc)(float [] , float [])) 

Given a starting point p[l. .n] that is a vector of length n, the Broyden-Fletcher-Goldfarb- 
Shanno variant of Davidon-Fletcher-Powell minimization is performed on a function func, using 
its gradient as calculated by a routine dfunc. The convergence requirement on zeroing the 
gradient is input as gtol. Returned quantities are p[l. .n] (the location of the minimum), 
iter (the number of iterations that were performed), and fret (the minimum value of the 
function). The routine lnsrch is called to perform approximate line minimizations. 

{ 

void lnsrch(int n, float xold[] , float fold, float g[], float p[], float x[], 
float ♦f, float stpmax, int ♦check, float (♦func)(float [])); 
int check,i,its, j ; 

float den,fac,fad,fae,fp,stpmax,sum=0.0,sumdg,sumxi,temp,test; 
float *dg,*g,*hdg,♦♦hessin,♦pnew,*xi; 

dg=vector(1,n); 
g=vector(l,n); 
hdg=vector(l,n); 
hessin=matrix(l,n,1,n); 
pnew=vector(1,n); 
xi=vector(l,n); 
fp=(*func)(p); 

(♦dfunc)(p,g); 
for (i=l;i<=n;i++) { 

for (j=l;j<=n; j++) hessin[i] [j]=0.0 
hessin[i][i]=1.0; 
xi[i] = -g[i] ; 
sum += p[i]*p[i] ; 

> 

stpmax=STPMX+FMAX(sqrt(sum),(float)n); 


Calculate starting function value and gra¬ 
dient, 

and initialize the inverse Hessian to the 
unit matrix. 

Initial line direction. 
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for (its=l; its<=ITMAX; its++) { Main loop over the iterations. 

*iter=its; 

lnsrch(n,p,fp,g,xi,pnew,fret,stpmax,fecheck,func); 

The new function evaluation occurs in lnsrch; save the function value in fp for the 
next line search. It is usually safe to ignore the value of check, 
fp = *fret; 
for (i=l;i<=n;i++) { 

xi[i]=pnew[i]-p[i] ; Update the line direction, 

p[i]=pnew[i] ; and the current point. 

> 

test=0.0; Test for convergence on Ax. 

for (i=l;i<=n;i++) { 

temp=fabs(xi [i] )/FMAX(fabs(p[i] ) , 1.0); 
if (temp > test) test=temp; 

> 

if (test < TOLX) { 

FREEALL 
return; 

> 

for (i=l;i<=n;i++) dg[i]=g[i]; Save the old gradient, 

(*dfunc) (p,g) ; and get the new gradient. 

test=0.0; Test for convergence on zero gradient. 

den=FMAX(*fret,1.0); 

for (i=l;i<=n;i++) { 

temp=fabs(g[i] )*FMAX(f abs(p[i]), 1.0) /den; 
if (temp > test) test=temp; 

> 

if (test < gtol) { 

FREEALL 


return; 

> 

for (i=l;i<=n;i++) dg[i]=g[i]-dg[i] ; Compute difference of gradients, 
for (i=l;i<=n;i++) { and difference times current matrix. 

hdg[i]=0.0; 

for (j=l; j<=n; j++) hdg[i] += hessinfi] [j]*dg[j] ; 

> 

fac=fae=sumdg=sumxi=0.0; Calculate dot products for the denomi- 

for (i=l;i<=n;i++) { nators. 

fac += dg[i] *xi [i] ; 
fae += dg[i] *hdg [i] ; 
sumdg += SQR(dg[i]); 
sumxi += SQR(xi[i]); 

> 

if (fac > sqrt(EPS*sumdg*sumxi)) { Skip update if fac not sufficiently posi- 
fac=1.0/fac; tive. 

fad=l.0/fae; 

The vector that makes BFGS different from DFP: 
for (i=l;i<=n;i++) dg[i]=fac*xi[i]-fad*hdg[i]; 
for (i=l;i<=n;i++) { The BFGS updating formula: 

for (j=i;j<=n;j++) { 

hessinfi] [j] += f ac*xi [i] *xi [j] 

-fad*hdg[i] *hdg[j]+fae*dg[i] *dg[j] ; 
hessinfj] [i]=hessin[i] [j] ; 


> 

for (i=l;i<=n;i++) { Now calculate the next direction to go, 

xi[i]=0.0; 

for (j=l; j<=n; j++) xi[i] -= hessinfi] [j]*g[j] ; 

> 

> and go back for another iteration. 

nrerror("too many iterations in dfpmin"); 

FREEALL 



Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 





430 


Chapter 10. Minimization or Maximization of Functions 


Quasi-Newton methods like dfpmin work well with the approximate line 
minimization done by lnsrch. The routines powell (§10.5) and frprmn (§10.6), 
however, need more accurate line minimization, which is carried out by the routine 
linmin. 

Advanced Implementations of Variable Metric Methods 

Although rare, it can conceivably happen that roundoff errors cause the matrix Hi to 
become nearly singular or non-positive-definite. This can be serious, because the supposed 
search directions might then not lead downhill, and because nearly singular Hi’s tend to give 
subsequent Hi’s that are also nearly singular. 

There is a simple fix for this rare problem, the same as was mentioned in §10.4: In case 
of any doubt, you should restart the algorithm at the claimed minimum point, and see if it 
goes anywhere. Simple, but not very elegant. Modem implementations of variable metric 
methods deal with the problem in a more sophisticated way. 

Instead of building up an approximation to A -1 , it is possible to build up an approximation 
of A itself. Then, instead of calculating the left-hand side of (10.7.4) directly, one solves 
the set of linear equations 

A • (x — Xi) = —V/(xi) (10.7.11) 

At first glance this seems like a bad idea, since solving (10.7.11) is a process of order 
N 3 — and anyway, how does this help the roundoff problem? The trick is not to store A but 
rather a triangular decomposition of A, its Cholesky decomposition (cf. §2.9). The updating 
formula used for the Cholesky decomposition of A is of order N 2 and can be arranged to 
guarantee that the matrix remains positive definite and nonsingular, even in the presence of 
finite roundoff. This method is due to Gill and Murray [1.2], 
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10.8 Linear Programming and the Simplex 
Method 


The subject of linear programming , sometimes called linear optimization, 
concerns itself with the following problem: For N independent variables x \,..., xn, 
maximize the function 



Z — CtQlXi ~b 0^022-2 + ' ' ' + GtOiVS'iV (10.8.1) 


subject to the primary constraints 

X\ > 0, X2 > 0, ... Xjv > 0 


( 10 . 8 . 2 ) 
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and simultaneously subject to M = m i + m 2 + m 3 additional constraints, mi of 
them of the form 


anxi + a,i2X2 H- \-a,iNXN<bi (bi > 0) i=l,...,«tj (10.8.3) 

m 2 of them of the form 

djixi + aj2X2 H-1- ajNXN >bj> 0 j = mi + 1,..., mi + m 2 (10.8.4) 

and m 3 of them of the form 

akixi + a k 2 X 2 H-h a kN x N = b k > 0 

_ „ (10.8.5) 

k = mi + m 2 + 1 ,..., mi + m 2 + m 3 

The various a t j ’s can have either sign, or be zero. The fact that the V s must all be 
nonnegative (as indicated by the final inequality in the above three equations) is a 
matter of convention only, since you can multiply any contrary inequality by — 1 . 
There is no particular significance in the number of constraints M being less than, 
equal to, or greater than the number of unknowns N. 

A set of values X\... xn that satisfies the constraints (10.8.2)—(10.8.5) is called 
a feasible vector. The function that we are trying to maximize is called the objective 
function. The feasible vector that maximizes the objective function is called the 
optimal feasible vector. An optimal feasible vector can fail to exist for two distinct 
reasons: (i) there are no feasible vectors, i.e., the given constraints are incompatible, 
or (ii) there is no maximum, i.e., there is a direction in N space where one or more 
of the variables can be taken to infinity while still satisfying the constraints, giving 
an unbounded value for the objective function. 

As you see, the subject of linear programming is surrounded by notational and 
terminological thickets. Both of these thorny defenses are lovingly cultivated by a 
coterie of stem acolytes who have devoted themselves to the field. Actually, the 
basic ideas of linear programming are quite simple. Avoiding the shrubbery, we 
want to teach you the basics by means of a couple of specific examples; it should 
then be quite obvious how to generalize. 

Why is linear programming so important? (i) Because “nonnegativity” is the 
usual constraint on any variable Xi that represents the tangible amount of some 
physical commodity, like guns, butter, dollars, units of vitamin E, food calories, 
kilowatt hours, mass, etc. Hence equation (10.8.2). (ii) Because one is often 
interested in additive (linear) limitations or bounds imposed by man or nature: 
minimum nutritional requirement, maximum affordable cost, maximum on available 
labor or capital, minimum tolerable level of voter approval, etc. Hence equations 
(10.8.3)-(10.8.5). (iii) Because the function that one wants to optimize may be 
linear, or else may at least be approximated by a linear function — since that is the 
problem that linear programming can solve. Hence equation (10.8.1). For a short, 
semipopular survey of linear programming applications, see Bland [1J. 



Here is a specific example of a problem in linear programming, which has 
N = 4, mi = 2, m 2 = m 3 = 1, hence M = 4: 

Maximize z = x 1 + £2 + 32:3 — \x± 


( 10 . 8 . 6 ) 
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Figure 10.8.1. Basic concepts of linear programming. The case of only two independent variables, 
xi, X 2 , is shown. The linear function z, to be maximized, is represented by its contour lines. Primary 
constraints require x\ and xi to be positive. Additional constraints may restrict the solution to regions 
(inequality constraints) or to surfaces of lower dimensionality (equality constraints). Feasible vectors 
satisfy all constraints. Feasible basic vectors also lie on the boundary of the allowed region. The simplex 
method steps among feasible basic vectors until the optimal feasible vector is found. 

with all the x’s nonnegative and also with 


x i + 2 2:3 < 740 

2X2 — 7X4 < 0 

, (10.8.7) 

X2 - X 3 + 2x 4 > \ 

Xl + X 2 + X 3 + X 4 = 9 

The answer turns out to be (to 2 decimals) x 1 = 0, X 2 = 3.33, x 3 = 4.73, X 4 = 0.95. 
In the rest of this section we will learn how this answer is obtained. Figure 10.8.1 
summarizes some of the terminology thus far. 

Fundamental Theorem of Linear Optimization 



Imagine that we start with a full TV-dimensional space of candidate vectors. Then 
(in mind’s eye, at least) we carve away the regions that are eliminated in turn by each 
imposed constraint. Since the constraints are linear, every boundary introduced by 
this process is a plane, or rather hyperplane. Equality constraints of the form (10.8.5) 
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force the feasible region onto hyperplanes of smaller dimension, while inequalities 
simply divide the then-feasible region into allowed and nonallowed pieces. 

When all the constraints are imposed, either we are left with some feasible 
region or else there are no feasible vectors. Since the feasible region is bounded by 
hyperplanes, it is geometrically a kind of convex polyhedron or simplex (cf. §10.4). 
If there is a feasible region, can the optimal feasible vector be somewhere in its 
interior, away from the boundaries? No, because the objective function is linear. 
This means that it always has a nonzero vector gradient. This, in turn, means that 
we could always increase the objective function by running up the gradient until 
we hit a boundary wall. 

The boundary of any geometrical region has one less dimension than its interior. 
Therefore, we can now run up the gradient projected into the boundary wall until we 
reach an edge of that wall. We can then run up that edge, and so on, down through 
whatever number of dimensions, until we finally arrive at a point, a vertex of the 
original simplex. Since this point has all N of its coordinates defined, it must be 
the solution of N simultaneous equalities drawn from the original set of equalities 
and inequalities (10.8.2)—(10.8.5). 

Points that are feasible vectors and that satisfy N of the original constraints 
as equalities, are termed feasible basic vectors. If N > M, then a feasible basic 
vector has at least N — M of its components equal to zero, since at least that many 
of the constraints (10.8.2) will be needed to make up the total of N. Put the other 
way, at most M components of a feasible basic vector are nonzero. In the example 
(10.8.6)—(10.8.7), you can check that the solution as given satisfies as equalities the 
last three constraints of (10.8.7) and the constraint x i > 0, for the required total of 4. 

Put together the two preceding paragraphs and you have the Fundamental 
Theorem of Linear Optimization: If an optimal feasible vector exists, then there is a 
feasible basic vector that is optimal. (Didn’t we warn you about the terminological 
thicket?) 

The importance of the fundamental theorem is that it reduces the optimization 
problem to a “combinatorial” problem, that of determining which N constraints 
(out of the M + N constraints in 10.8.2-10.8.5) should be satisfied by the optimal 
feasible vector. We have only to keep trying different combinations, and computing 
the objective function for each trial, until we find the best. 

Doing this blindly would take halfway to forever. The simplex method , first 
published by Dantzig in 1948 (see [2]), is a way of organizing the procedure so that 
(i) a series of combinations is tried for which the objective function increases at each 
step, and (ii) the optimal feasible vector is reached after a number of iterations that 
is almost always no larger than of order M or N, whichever is larger. An interesting 
mathematical sidelight is that this second property, although known empirically ever 
since the simplex method was devised, was not proved to be true until the 1982 work 
of Stephen Smale. (For a contemporary account, see [3].) 


Simplex Method for a Restricted Normal Form 



A linear programming problem is said to be in normal form if it has no 
constraints in the form (10.8.3) or (10.8.4), but rather only equality constraints of the 
form (10.8.5) and nonnegativity constraints of the form (10.8.2). 
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For our purposes it will be useful to consider an even more restricted set of cases, 
with this additional property: Each equality constraint of the form (10.8.5) must 
have at least one variable that has a positive coefficient and that appears uniquely in 
that one constraint only. We can then choose one such variable in each constraint 
equation, and solve that constraint equation for it. The variables thus chosen are 
called left-hand variables or basic variables, and there are exactly M (= to 3) of 
them. The remaining N — M variables are called right-hand variables or nonbasic 
variables. Obviously this restricted normal form can be achieved only in the case 
M < N, so that is the case that we will consider. 

You may be thinking that our restricted normal form is so specialized that 
it is unlikely to include the linear programming problem that you wish to solve. 
Not at all! We will presently show how any linear programming problem can be 
transformed into restricted normal form. Therefore bear with us and learn how to 
apply the simplex method to a restricted normal form. 

Here is an example of a problem in restricted normal form: 


Maximize z = 2 x 2 — 4 x 3 
with x\, x' 2 , a: 3 , and X 4 all nonnegative and also with 
x\ = 2 — 6 x 2 + X 3 
X 4 = 8 + 3x2 - 4x3 


( 10 . 8 . 8 ) 


(10.8.9) 


This example has N = 4, M = 2; the left-hand variables are xi and X 4 ; the 
right-hand variables are X2 and X3. The objective function (10.8.8) is written so 
as to depend only on right-hand variables; note, however, that this is not an actual 
restriction on objective functions in restricted normal form, since any left-hand 
variables appearing in the objective function could be eliminated algebraically by 
use of (10.8.9) or its analogs. 

For any problem in restricted normal form, we can instantly read off a feasible 
basic vector (although not necessarily the optimal feasible basic vector). Simply set 
all right-hand variables equal to zero, and equation (10.8.9) then gives the values of 
the left-hand variables for which the constraints are satisfied. The idea of the simplex 
method is to proceed by a series of exchanges. In each exchange, a right-hand 
variable and a left-hand variable change places. At each stage we maintain a problem 
in restricted normal form that is equivalent to the original problem. 

It is notationally convenient to record the information content of equations 
(10.8.8) and (10.8.9) in a so-called tableau, as follows: 



X 2 

X 3 

% 

0 

2 

-4 

Xi 

2 

-6 

1 

X 4 

8 

3 

-4 


You should study (10.8.10) to be sure that you understand where each entry comes 
from, and how to translate back and forth between the tableau and equation formats 
of a problem in restricted normal form. 



S, § g 
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The first step in the simplex method is to examine the top row of the tableau, 
which we will call the “z-row.” Look at the entries in columns labeled by right-hand 
variables (we will call these “right-columns”). We want to imagine in turn the effect 
of increasing each right-hand variable from its present value of zero, while leaving 
all the other right-hand variables at zero. Will the objective function increase or 
decrease? The answer is given by the sign of the entry in the z-row. Since we want 
to increase the objective function, only right columns having positive z-row entries 
are of interest. In (10.8.10) there is only one such column, whose z-row entry is 2. 

The second step is to examine the column entries below each z-row entry that 
was selected by step one. We want to ask how much we can increase the right-hand 
variable before one of the left-hand variables is driven negative, which is not allowed. 
If the tableau element at the intersection of the right-hand column and the left-hand 
variable’s row is positive, then it poses no restriction: the corresponding left-hand 
variable will just be driven more and more positive. If all the entries in any right-hand 
column are positive, then there is no bound on the objective function and (having 
said so) we are done with the problem. 

If one or more entries below a positive z-row entry are negative, then we have 
to figure out which such entry first limits the increase of that column’s right-hand 
variable. Evidently the limiting increase is given by dividing the element in the right- 
hand column (which is called the pivot element) into the element in the “constant 
column” (leftmost column) of the pivot element’s row. A value that is small in 
magnitude is most restrictive. The increase in the objective function for this choice 
of pivot element is then that value multiplied by the z-row entry of that column. We 
repeat this procedure on all possible right-hand columns to find the pivot element 
with the largest such increase. That completes our “choice of a pivot element.” 

In the above example, the only positive z-row entry is 2. There is only one 
negative entry below it, namely — 6 , so this is the pivot element. Its constant-column 
entry is 2. This pivot will therefore allow X 2 to be increased by 2 -=-16|, which results 
in an increase of the objective function by an amount (2 x 2 ) -t | 6 |. 

The third step is to do the increase of the selected right-hand variable, thus 
making it a left-hand variable; and simultaneously to modify the left-hand variables, 
reducing the pivot-row element to zero and thus making it a right-hand variable. For 
our above example let’s do this first by hand: We begin by solving the pivot-row 
equation for the new left-hand variable X 2 in favor of the old one x i, namely 


Xl = 2 - 6x 2 + X 3 —> X2 = | - §£i + (10.8.11) 

We then substitute this into the old z-row, 

Z = 2x 2 - 4x 3 = 2[\- \xi + ia; 3 ] - 4a; 3 = § - \xi - ^-x 3 (10.8.12) 



and into all other left-variable rows, in this case only x 4 , 


x 4 = 8 + 3 [g — \x\ + ga; 3 ] - 4x 3 = 9 - \x\ - ^x 3 


(10.8.13) 
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Equations (10.8.11)—(10.8.13) form the new tableau 



The fourth step is to go back and repeat the first step, looking for another possible 
increase of the objective function. We do this as many times as possible, that is, until 
all the right-hand entries in the z-row are negative, signaling that no further increase 
is possible. In the present example, this already occurs in (10.8.14), so we are done. 

The answer can now be read from the constant column of the final tableau. In 
(10.8.14) we see that the objective function is maximized to a value of 2/3 for the 
solution vector X 2 = 1/3, X 4 = 9, x\ = £3 = 0. 

Now look back over the procedure that led from (10.8.10) to (10.8.14). You will 
find that it could be summarized entirely in tableau format as a series of prescribed 
elementary matrix operations: 

• Locate the pivot element and save it. 

• Save the whole pivot column. 

• Replace each row, except the pivot row, by that linear combination of itself 
and the pivot row which makes its pivot-column entry zero. 

• Divide the pivot row by the negative of the pivot. 

• Replace the pivot element by the reciprocal of its saved value. 

• Replace the rest of the pivot column by its saved values divided by the 
saved pivot element. 

This is the sequence of operations actually performed by a linear programming 
routine, such as the one that we will presently give. 

You should now be able to solve almost any linear programming problem that 
starts in restricted normal form. The only special case that might stump you is 
if an entry in the constant column turns out to be zero at some stage, so that a 
left-hand variable is zero at the same time as all the right-hand variables are zero. 
This is called a degenerate feasible vector. To proceed, you may need to exchange 
the degenerate left-hand variable for one of the right-hand variables, perhaps even 
making several such exchanges. 

Writing the General Problem in Restricted Normal Form 

Here is a pleasant surprise. There exist a couple of clever tricks that render 
trivial the task of translating a general linear programming problem into restricted 
normal form! 

First, we need to get rid of the inequalities of the form (10.8.3) or (10.8.4), for 
example, the first three constraints in (10.8.7). We do this by adding to the problem 
so-called slack variables which, when their nonnegativity is required, convert the 
inequalities to equalities. We will denote slack variables as y t . There will be 
mi + m 2 of them. Once they are introduced, you treat them on an equal footing 
with the original variables Xf, then, at the very end, you simply ignore them. 
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For example, introducing slack variables leaves (10.8.6) unchanged but turns 
(10.8.7) into 


xi + 2x 3 + yi = 740 

2x 2 - 7X4 + 2/2 = 0 
x 2 - x 3 + 2x 4 - 2/3 = \ 


(10.8.15) 


(Notice how the sign of the coefficient of the slack variable is determined by which 
sense of inequality it is replacing.) 

Second, we need to insure that there is a set of M left-hand vectors, so that we 
can set up a starting tableau in restricted normal form. (In other words, we need to 
find a “feasible basic starting vector.”) The trick is again to invent new variables! 
There are M of these, and they are called artificial variables', we denote them by z%. 
You put exactly one artificial variable into each constraint equation on the following 
model for the example (10.8.15): 


Zl = 740 - xi - 2x 3 - yl 
z 2 = -2x 2 + 7x 4 - 2/2 
Z 3 = \ - X 2 + Z3 - 2x 4 + 2/3 
Z 4 , = 9 - XI - X2 - X3 - X4 


(10.8.16) 


Our example is now in restricted normal form. 

Now you may object that (10.8.16) is not the same problem as (10.8.15) or 
(10.8.7) unless all the zfis are zero. Right you are! There is some subtlety here! 
We must proceed to solve our problem in two phases. First phase: We replace our 
objective function (10.8.6) by a so-called auxiliary objective function 

z! = -zi - z 2 - z 3 - z 4 = -(749| - 2xi - 4x 2 - 2x 3 + 4x 4 - 2/i - 2/2 + 2/3) 

(10.8.17) 

(where the last equality follows from using 10.8.16). We now perform the simplex 
method on the auxiliary objective function (10.8.17) with the constraints (10.8.16). 
Obviously the auxiliary objective function will be maximized for nonnegative 2 ,’s if 
all the Zi s are zero. We therefore expect the simplex method in this first phase to 
produce a set of left-hand variables drawn from the x»’s and yf s only, with all the 
zf s being right-hand variables. Aha! We then cross out the zf s, leaving a problem 
involving only x j’s and yfs in restricted normal form. In other words, the first phase 
produces an initial feasible basic vector. Second phase: Solve the problem produced 
by the first phase, using the original objective function, not the auxiliary. 

And what if the first phase doesn’t produce zero values for all the zf s? That 
signals that there is no initial feasible basic vector, i.e., that the constraints given to 
us are inconsistent among themselves. Report that fact, and you are done. 

Here is how to translate into tableau format the information needed for both the 
first and second phases of the overall method. As before, the underlying problem 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



438 


Chapter 10. Minimization or Maximization of Functions 


to be solved is as posed in equations (10.8.6)—(10.8.7). 



Xi 

X2 

X 3 

x 4 

Vi 

2/2 

2/3 

z 

0 

1 

1 

3 

1 

2 

0 

0 

0 


740 

-1 

0 

-2 

0 

-1 

0 

0 

Z2 

0 

0 

-2 

0 

7 

0 

-1 

0 

zs 

1 

2 

0 

-1 

1 

-2 

0 

0 

1 

Z4 

9 

-1 

-1 

-1 

-1 

0 

0 

0 

z' 

-749 § 

2 

4 

2 

-4 

1 

1 

-1 


(10.8.18) 


This is not as daunting as it may, at first sight, appear. The table entries inside 
the box of double lines are no more than the coefficients of the original problem 
(10.8.6)—(10.8.7) organized into a tabular form. In fact, these entries, along with 
the values of N, M, mi, m2, and m3, are the only input that is needed by the 
simplex method routine below. The columns under the slack variables y* simply 
record whether each of the M constraints is of the form <, >, or =; this is redundant 
information with the values mi, m 2 , m3, as long as we are sure to enter the rows of 
the tableau in the correct respective order. The coefficients of the auxiliary objective 
function (bottom row) are just the negatives of the column sums of the rows above, 
so these are easily calculated automatically. 

The output from a simplex routine will be (i) a flag telling whether a finite 
solution, no solution, or an unbounded solution was found, and (ii) an updated tableau. 
The output tableau that derives from (10.8.18), given to two significant figures, is 



X\ 

2/2 

2/3 


Z 

17.03 

-.95 

-.05 

-1.05 


X2 

3.33 

-.35 

-.15 

.35 


X3 

4.73 

— .55 

.05 

-.45 


£4 

.95 

-.10 

.10 

.10 


2/1 

730.55 

.10 

-.10 

.90 



(10.8.19) 

A little counting of the x,’s and y.;'s will convince you that there are M + 1 
rows (including the z-row) in both the input and the output tableaux, but that only 
N + 1 — m3 columns of the output tableau (including the constant column) contain 
any useful information, the other columns belonging to now-discarded artificial 
variables. In the output, the first numerical column contains the solution vector, 
along with the maximum value of the objective function. Where a slack variable (y j) 
appears on the left, the corresponding value is the amount by which its inequality 
is safely satisfied. Variables that are not left-hand variables in the output tableau 
have zero values. Slack variables with zero values represent constraints that are 
satisfied as equalities. 
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Routine Implementing the Simplex Method 

The following routine is based algorithmically on the implementation of Kuenzi, 
Tzschach, and Zehnder [4], Aside from input values of M, N, mi, m 2 , m 3 , the 
principal input to the routine is a two-dimensional array a containing the portion of 
the tableau (10.8.18) that is contained between the double lines. This input occupies 
the M + 1 rows and N + 1 columns of a[l. .m+ 1 ] [1 . .n+ 1 ]. Note, however, that 
reference is made internally to row M + 2 of a (used for the auxiliary objective 
function, just as in 10.8.18). Therefore the variable declared as float **a, must 
point to allocated memory allowing references in the subrange 

a[i] [fc], i = 1.. .m+2, k = 1.. .n+1 (10.8.20) 

You will suffer endless agonies if you fail to understand this simple point. Also do 
not neglect to order the rows of a in the same order as equations (10.8.1), (10.8.3), 
(10.8.4), and (10.8.5), that is, objective function, <-constraints, >-constraints, 
=-constraints. 

On output, the tableau a is indexed by two returned arrays of integers, ipo s v [ j ] 
contains, for j = 1... M, the number i whose original variable x * is now represented 
by row j+1 of a. These are thus the left-hand variables in the solution. (The first row 
of a is of course the z-row.) A value i > N indicates that the variable is a y* rather 
than an x%, %N+j = Vj■ Likewise, izrov [j] contains, for j = 1... N, the number i 
whose original variable Xi is now a right-hand variable, represented by column j+1 
of a. These variables are all zero in the solution. The meaning of i > N is the same 
as above, except that i > TV + rri\ + rn -2 denotes an artificial or slack variable which 
was used only internally and should now be entirely ignored. 

The flag icase is set to zero if a finite solution is found, +1 if the objective 
function is unbounded, —1 if no solution satisfies the given constraints. 

The routine treats the case of degenerate feasible vectors, so don’t worry about 
them. You may also wish to admire the fact that the routine does not require storage 
for the columns of the tableau (10.8.18) that are to the right of the double line; it 
keeps track of slack variables by more efficient bookkeeping. 

Please note that, as given, the routine is only “semi-sophisticated” in its tests 
for convergence. While the routine properly implements tests for inequality with 
zero as tests against some small parameter EPS, it does not adjust this parameter to 
reflect the scale of the input data. This is adequate for many problems, where the 
input data do not differ from unity by too many orders of magnitude. If, however, 
you encounter endless cycling, then you should modify EPS in the routines simplx 
and simp 2 . Permuting your variables can also help. Finally, consult [5], 

#include "nrutil.h" 

#define EPS 1.0e-6 

Here EPS is the absolute precision, which should be adjusted to the scale of your variables. 
#define FREEALL free_ivector(13,l,m);free_ivector(ll,l,n+l); 

void simplxffloat **a, int m, int n, int ml, int m2, int m3, int *icase, 
int izrov [] , int iposv[]) 

Simplex method for linear programming. Input parameters a, m, n, mp, np, ml, m2, and m3, 
and output parameters a, icase, izrov, and iposv are described above. 

{ 



void simpl(float **a, int mm, int 11[], int nil, int iabf, int *kp, 
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float *bmax); 

void simp2(float **a, int m, int n, int *ip, int kp); 
void simp3(float **a, int il, int kl, int ip, int kp); 
int i,ip,is,k,kh,kp,nll; 
int *11,*13; 
float ql.bmax; 


if (m != (ml+m2+m3)) nrerrorO'Bad input constraint counts in simplx"); 
ll=ivector(l,n+l); 

13=ivector(l,m); 
nll=n; 

for (k=l;k<=n;k++) 11[k]=izrov[k]=k; 

Initialize index list of columns admissible for exchange, and make all variables initially 
right-hand. 

for (i=l;i<=m;i++) { 

if (a[i+l] [1] < 0.0) nrerrorO'Bad input tableau in simplx"); 

Constants bi must be nonnegative. 
iposv[i]=n+i; 

Initial left-hand variables, ml type constraints are represented by having their slack 
variable initially left-hand, with no artificial variable, m2 type constraints have their 
slack variable initially left-hand, with a minus sign, and their artificial variable handled 
implicitly during their first exchange, m3 type constraints have their artificial variable 
initially left-hand. 

> 

if (m2+m3) { Origin is not a feasible starting so- 

for (i=l;i<=m2;i++) 13[i]=l; lution: we must do phase one. 

Initialize list of m2 constraints whose slack variables have never been exchanged out 
of the initial basis. 

for (k=l;k<=(n+l) ;k++) { Compute the auxiliary objective func- 

ql=0.0; tion. 

for (i=ml+l;i<=m;i++) ql += a[i+l][k]; 
a[m+2] [k] = -ql; 

> 

for (;;) { 

simpl(a,m+l,ll,nll,0,&kp,&bmax) ; Find max. coeff. of auxiliary objec- 

if (bmax <= EPS kk a[m+2] [1] < -EPS) { tive fn. 


*icase = -1; 

Auxiliary objective function is still negative and can't be improved, hence no 
feasible solution exists. 

FREEALL return; 

> else if (bmax <= EPS kk a[m+2] [1] <= EPS) { 

Auxiliary objective function is zero and can't be improved; we have a feasible 
starting vector. Clean out the artificial variables corresponding to any remaining 
equality constraints by goto one and then move on to phase two. 
for (ip=ml+m2+l;ip<=m;ip++) { 

if (iposv[ip] == (ip+n)) { Found an artificial variable for an 
simpl(a,ip,ll,nll,l,&kp,&bmax) ; equality constraint, 

if (bmax > EPS) Exchange with column correspond- 

goto one; ing to maximum pivot element 

> in row. 

> 


for (i=ml+l;i<=ml+m2;i++) 
if (13[i-ml] == 1) 

for (k=l;k<=n+l;k++) 
a[i+l] [k] = -a[i+l] 

break; 

> 

simp2(a,m,n,&ip,kp); 
if (ip == 0) { 

*icase = -1; 

FREEALL return; 


Change sign of row for any m2 con¬ 
straints still present from the ini¬ 
tial basis. 

Go to phase two. 

Locate a pivot element (phase one). 

Maximum of auxiliary objective func¬ 
tion is unbounded, so no feasi¬ 
ble solution exists. 


> 

simp3(a,m+l,n,ip,kp); 

Exchange a left- and a right-hand variable (phase one), then update lists. 





one: 
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> 


if (iposv[ip] >= (n+ml+m2+l)) { Exchanged out an artificial variable 

for (k=l;k<=nll;k++) for an equality constraint. Make 

if (11 [k] == kp) break; sure it stays out by removing it 

—nil; from the 11 list, 

for (is=k;is<=nll;is++) 11[is]=ll[is+1]; 

> else { 

kh=iposv[ip]-ml-n; 
if (kh >= 1 && 13[kh]) { 

13 [kh] =0; 

++a [m+2] [kp+1] ; 
for (i=l;i<=m+2;i++) 

a [i] [kp+1] = -a[i] [kp+1] ; 

> 


Exchanged out an m2 type constraint 
for the first time. Correct the 
pivot column for the minus sign 
and the implicit artificial vari¬ 
able. 


> 

is=izrov[kp] ; Update lists of left- and right-hand 

izrov [kp] =iposv [ip] ; variables, 

iposv[ip]=is; 

> Still in phase one, go back to the 

> for(;;). 

End of phase one code for finding an initial feasible solution. Now, in phase two, optimize 


for (;;) { 

simpl(a,0,11,nll ,0 ,fckp,&bmax); 
if (bmax <= EPS) { 

*icase=0; 

FREEALL return; 

> 

simp2(a,m,n,&ip,kp); 
if (ip == 0) { 

*icase=l; 

FREEALL return; 

} 

simp3(a,m,n,ip,kp); 
is=izrov[kp]; 
izrov[kp]=iposv [ip]; 
iposv [ip]=is; 


Test the z-row for doneness. 

Done. Solution found. Return with 
the good news. 


Locate a pivot element (phase two). 
Objective function is unbounded. Re¬ 
port and return. 


Exchange a left- and a right-hand 
variable (phase two), 

and return for another iteration. 


The preceding routine makes use of the following utility functions. 


#include <math.h> 

void simpl(float **a, int mm, int 11[], int nil, int iabf, int *kp, 
float *bmax) 

Determines the maximum of those elements whose index is contained in the supplied list 11, 
either with or without taking the absolute value, as flagged by iabf. 

{ 

int k; 
float test; 

if (nil <= 0) No eligible columns. 

*bmax=0.0; 
else { 

*kp=ll [1] ; 

*bmax=a[mm+l] [*kp+l] ; 
for (k=2;k<=nll;k++) { 
if (iabf == 0) 

test=a[mm+l] [11 [k] +1] - (*bmax); 
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else 

test=fabs(a[nmH-l] [ll[k]+l] )-f abs(*bmax) ; 
if (test > 0.0) { 

*bmax=a[irnn+1] [ll[k]+l] ; 

*kp=ll[k]; 

> 

> 

> 

> 


#define EPS 1.0e-6 

void simp2(float **a, int m, int n, int *ip, int kp) 
Locate a pivot element, taking degeneracy into account. 

{ 

int k,i; 

float qp,q0,q,ql; 


*ip=0; 

for (i=l;i<=m;i++) 

if (a[i+l] [kp+1] < -EPS) break; Any possible pivots? 

if (i>m) return; 
ql = -a[i+l] [l]/a[i+l] [kp+1] ; 

*ip=i; 

for (i=*ip+l;i<=m;i++) { 

if (a[i+l] [kp+1] < -EPS) { 

q = -a[i+l] [l]/a[i+l] [kp+1] ; 
if (q < ql) { 

*i P =i; 

q 1= q; 

> else if (q == ql) { We have a degeneracy. 

for (k=l;k<=n;k++) { 

qp = -a[*ip+l] [k+1]/a[*ip+l] [kp+1] ; 
qO = -a[i+l] [k+1]/a[i+l] [kp+1] ; 
if (qO != qp) break; 

> 

if (qO < qp) *ip=i; 

> 

> 

> 

> 


void simp3(float **a, int il, int kl, int ip, int kp) 

Matrix operations to exchange a left-hand and right-hand variable (see text). 

{ 

int kk,ii; 
float piv; 

piv=1.0/a[ip+l][kp+1]; 
for (ii=l;ii<=il+l;ii++) 
if (ii-1 != ip) { 

a[ii] [kp+1] *= piv; 
for (kk=l;kk<=kl+l;kk++) 
if (kk-1 != kp) 

a[ii] [kk] -= a[ip+l] [kk] *a[ii] [kp+1] ; 

> 
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for (kk=l ;kk<=kl+l ;kk++) 

if (kk-1 != kp) a[ip+l][kk] *= -piv; 
a[ip+l] [kp+l]=piv; 


Other Topics Briefly Mentioned 

Every linear programming problem in normal form with N variables and M 
constraints has a corresponding dual problem with M variables and N constraints. 
The tableau of the dual problem is, in essence, the transpose of the tableau of the 
original (sometimes called primal) problem. It is possible to go from a solution 
of the dual to a solution of the primal. This can occasionally be computationally 
useful, but generally it is no big deal. 

The revised simplex method is exactly equivalent to the simplex method in its 
choice of which left-hand and right-hand variables are exchanged. Its computational 
effort is not significantly less than that of the simplex method. It does differ in 
the organization of its storage, requiring only a matrix of size M x M, rather than 
M x N, in its intermediate stages. If you have a lot of constraints, and memory 
size is one of them, then you should look into it. 

The primal-dual algorithm and the composite simplex algorithm are two dif¬ 
ferent methods for avoiding the two phases of the usual simplex method: Progress 
is made simultaneously towards finding a feasible solution and finding an optimal 
solution. There seems to be no clearcut evidence that these methods are superior 
to the usual method by any factor substantially larger than the “tender-loving-care 
factor” (which reflects the programming effort of the proponents). 

Problems where the objective function and/or one or more of the constraints are 
replaced by expressions nonlinear in the variables are called nonlinear programming 
problems. The literature on such problems is vast, but outside our scope. The special 
case of quadratic expressions is called quadratic programming. Optimization prob¬ 
lems where the variables take on only integer values are called integer programming 
problems, a special case of discrete optimization generally. The next section looks 
at a particular kind of discrete optimization problem. 
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10.9 Simulated Annealing Methods 


The method of simulated annealing [1,2] is a technique that has attracted signif¬ 
icant attention as suitable for optimization problems of large scale, especially ones 
where a desired global extremum is hidden among many, poorer, local extrema. For 
practical purposes, simulated annealing has effectively “solved” the famous traveling 
salesman problem of finding the shortest cyclical itinerary for a traveling salesman 
who must visit each of N cities in turn. (Other practical methods have also been 
found.) The method has also been used successfully for designing complex integrated 
circuits: The arrangement of several hundred thousand circuit elements on a tiny 
silicon substrate is optimized so as to minimize interference among their connecting 
wires [3,4]. Surprisingly, the implementation of the algorithm is relatively simple. 

Notice that the two applications cited are both examples of combinatorial 
minimization. There is an objective function to be minimized, as usual; but the space 
over which that function is defined is not simply the TV-dimensional space of N 
continuously variable parameters. Rather, it is a discrete, but very large, configuration 
space, like the set of possible orders of cities, or the set of possible allocations of 
silicon “real estate” blocks to circuit elements. The number of elements in the 
configuration space is factorially large, so that they cannot be explored exhaustively. 
Furthermore, since the set is discrete, we are deprived of any notion of “continuing 
downhill in a favorable direction.” The concept of “direction” may not have any 
meaning in the configuration space. 

Below, we will also discuss how to use simulated annealing methods for spaces 
with continuous control parameters, like those of §§10.4-10.7. This application is 
actually more complicated than the combinatorial one, since the familiar problem of 
“long, narrow valleys” again asserts itself. Simulated annealing, as we will see, tries 
“random” steps; but in a long, narrow valley, almost all random steps are uphill! 
Some additional finesse is therefore required. 

At the heart of the method of simulated annealing is an analogy with thermody¬ 
namics, specifically with the way that liquids freeze and crystallize, or metals cool 
and anneal. At high temperatures, the molecules of a liquid move freely with respect 
to one another. If the liquid is cooled slowly, thermal mobility is lost. The atoms are 
often able to line themselves up and form a pure crystal that is completely ordered 
over a distance up to billions of times the size of an individual atom in all directions. 
This crystal is the state of minimum energy for this system. The amazing fact is that, 
for slowly cooled systems, nature is able to find this minimum energy state. In fact, if 
a liquid metal is cooled quickly or “quenched,” it does not reach this state but rather 
ends up in a polycrystalline or amorphous state having somewhat higher energy. 

So the essence of the process is slow cooling, allowing ample time for 
redistribution of the atoms as they lose mobility. This is the technical definition of 
annealing, and it is essential for ensuring that a low energy state will be achieved. 
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Although the analogy is not perfect, there is a sense in which all of the 
minimization algorithms thus far in this chapter correspond to rapid cooling or 
quenching. In all cases, we have gone greedily for the quick, nearby solution: From 
the starting point, go immediately downhill as far as you can go. This, as often 
remarked above, leads to a local, but not necessarily a global, minimum. Nature’s 
own minimization algorithm is based on quite a different procedure. The so-called 
Boltzmann probability distribution, 

Prob (E) ~ exp(—E/kT) (10.9.1) 

expresses the idea that a system in thermal equilibrium at temperature T has its 
energy probabilistically distributed among all different energy states E. Even at 
low temperature, there is a chance, albeit very small, of a system being in a high 
energy state. Therefore, there is a corresponding chance for the system to get out of 
a local energy minimum in favor of finding a better, more global, one. The quantity 
k (Boltzmann’s constant) is a constant of nature that relates temperature to energy. 
In other words, the system sometimes goes uphill as well as downhill; but the lower 
the temperature, the less likely is any significant uphill excursion. 

In 1953, Metropolis and coworkers [5] first incorporated these kinds of prin¬ 
ciples into numerical calculations. Offered a succession of options, a simulated 
thermodynamic system was assumed to change its configuration from energy E i to 
energy E 2 with probability p = exp[— (E 2 — E\ )/ kT] . Notice that if E 2 < E\, this 
probability is greater than unity; in such cases the change is arbitrarily assigned a 
probability p = 1, i.e., the system always took such an option. This general scheme, 
of always taking a downhill step while sometimes taking an uphill step, has come 
to be known as the Metropolis algorithm. 

To make use of the Metropolis algorithm for other than thermodynamic systems, 
one must provide the following elements: 

1. A description of possible system configurations. 

2. A generator of random changes in the configuration; these changes are the 
“options” presented to the system. 

3. An objective function E (analog of energy) whose minimization is the 
goal of the procedure. 

4. A control parameter T (analog of temperature) and an annealing schedule 
which tells how it is lowered from high to low values, e.g., after how many random 
changes in configuration is each downward step in T taken, and how large is that 
step. The meaning of “high” and “low” in this context, and the assignment of a 
schedule, may require physical insight and/or trial-and-error experiments. 

Combinatorial Minimization: The Traveling Salesman 

A concrete illustration is provided by the traveling salesman problem. The 
proverbial seller visits N cities with given positions (x t , y t ), returning finally to his or 
her city of origin. Each city is to be visited only once, and the route is to be made as 
short as possible. This problem belongs to a class known as NP-complete problems, 
whose computation time for an exact solution increases with N as exp(const. x N), 
becoming rapidly prohibitive in cost as N increases. The traveling salesman problem 
also belongs to a class of minimization problems for which the objective function E 
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has many local minima. In practical cases, it is often enough to be able to choose 
from these a minimum which, even if not absolute, cannot be significantly improved 
upon. The annealing method manages to achieve this, while limiting its calculations 
to scale as a small power of N. 

As a problem in simulated annealing, the traveling salesman problem is handled 
as follows: 

1. Configuration. The cities are numbered i = 1.. .N and each has coordinates 
(. Xi,yi ). A configuration is a permutation of the number 1... N, interpreted as the 
order in which the cities are visited. 

2. Rearrangements. An efficient set of moves has been suggested by Lin [6], 
The moves consist of two types: (a) A section of path is removed and then replaced 
with the same cities running in the opposite order; or (b) a section of path is removed 
and then replaced in between two cities on another, randomly chosen, part of the path. 

3. Objective Function. In the simplest form of the problem, E is taken just 
as the total length of journey, 

N 

E = L = J2 'J( X i - x *+l) 2 + (yi - Ui+l) 2 (10.9.2) 

with the convention that point N + 1 is identified with point 1. To illustrate the 
flexibility of the method, however, we can add the following additional wrinkle: 
Suppose that the salesman has an irrational fear of flying over the Mississippi River. 
In that case, we would assign each city a parameter /j , , equal to +1 if it is east of the 
Mississippi, —1 if it is west, and take the objective function to be 

N 

E = x i+ 1 ) 2 + (yi - y i+ 1 ) 2 + \(m - m+i) 2 ] (10.9.3) 



A penalty 4A is thereby assigned to any river crossing. The algorithm now finds 
the shortest path that avoids crossings. The relative importance that it assigns to 
length of path versus river crossings is determined by our choice of A. Figure 10.9.1 
shows the results obtained. Clearly, this technique can be generalized to include 
many conflicting goals in the minimization. 

4. Annealing schedule. This requires experimentation. We first generate some 
random rearrangements, and use them to determine the range of values of A E that 
will be encountered from move to move. Choosing a starting value for the parameter 
T which is considerably larger than the largest A E normally encountered, we 
proceed downward in multiplicative steps each amounting to a 10 percent decrease 
in T. We hold each new value of T constant for, say, 100A reconfigurations, or for 
lOiV successful reconfigurations, whichever comes first. When efforts to reduce E 
further become sufficiently discouraging, we stop. 

The following traveling salesman program, using the Metropolis algorithm, 
illustrates the main aspects of the simulated annealing technique for combinatorial 
problems. 
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#include <stdio.h> 

#include <math.h> 

#define TFACTR 0.9 Annealing schedule: reduce t by this factor on each step. 

#define ALEN(a,b,c,d) sqrt (((b)-(a))*((b)-(a))+((d) -(c) )*((d)-(c))) 


void anneal(float x[], float y[], int iorder[], int ncity) 

This algorithm finds the shortest round-trip path to ncity cities whose coordinates are in the 
arrays x[l. .ncity] ,y[l. .ncity] . The array iorder[l. .ncity] specifies the order in 
which the cities are visited. On input, the elements of iorder may be set to any permutation 
of the numbers 1 to ncity. This routine will return the best alternative path it can find. 

{ 

int irbitl(unsigned long *iseed); 
int metrop(float de, float t); 
float ran3(long *idum); 

float revcst(float x[], float y[], int iorder[], int ncity, int n[]); 
void reverse(int iorder[], int ncity, int n[]); 

float trncst(float x[], float y[], int iorder[], int ncity, int n[]); 

void trnspt(int iorder[], int ncity, int n[]); 

int ans,nover,nlimit,il,i2; 

int i,j,k,nsucc,nn,idec; 

static int n[7] ; 

long idum; 

unsigned long iseed; 
float path,de,t; 


nover=100*ncity; 
nlimit=10*ncity; 
path=0.0; 
t=0.5; 


Maximum number of paths tried at any temperature. 
Maximum number of successful path changes before con¬ 
tinuing. 


for (i=l; i<ncity; i++) { Calculate initial path length. 
il=iorder [i]; 
i2=iorder [i+1]; 

path += ALEN(x[il] ,x[i2] ,y[il] ,y[i2]) ; 

> 


il=iorder[ncity]; 
i2=iorder[1]; 

path += ALEN(x[il] ,x[i2] ,y [il] ,y[i2]); 
idum = -1; 
iseed=lll; 

for (j=l;j<=100;j++) { 


Close the loop by tying path ends together. 


Try up to 100 temperature steps. 


nsucc=0; 

for (k=l;k<=nover;k++) { 


do { 

n[l]=l+(int) (ncity*ran3(&idum)); 
n[2]=l+(int) ((ncity-1)*ran3(&idum)); 
if (n[2] >= n [1]) ++n[2] ; 
nn=l+((n[l]-n[2]+ncity-l) '/. ncity); 

} while (nn<3); 
idec=irbitl(&iseed); 

Decide whether to do a segment reversal or transport, 
if (idee == 0) { Do a transport, 

n [3] =n [2]+(int) (abs(nn-2)*ran3 (&idum)) +1 ; 
n [3] =1+ ( (n [3] -1) '/.ncity); 

Transport to a location not on the path. 
de=trncst(x,y,iorder,ncity,n); Calculate cost. 

ans=metrop(de,t); 
if (ans) { 

++nsucc; 


Choose beginning of segment 
..and. end of segment. 

nn is the number of cities 
not on the segment. 


Consult the oracle. 


path += de; 

trnspt(iorder,ncity,n); 

> 

> else { 

de=revcst(x,y,iorder,ncity,n); 
ans=metrop(de,t); 


Carry out the transport. 

Do a path reversal. 
Calculate cost. 

Consult the oracle. 
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if (ans) { 

++nsucc; 
path += de; 

reverse(iorder,ncity,n); 

> 


Carry out the reversal. 


> 


if (nsucc >= nlimit) break; 


printf("\n ’/„s ’/„10.6f "/.s "/.12.6f \n 
" Path Length =",path); 
printf("Successful Moves: %6d\n",nsucc); 
t *= TFACTR; 
if (nsucc == 0) return; 


Finish early if we have enough suc¬ 
cessful changes. 

T =",t, 


Annealing schedule. 

If no success, we are done. 


#include <math.h> 

#def ine ALEN (a ,b, c, d) sqrt (((b) - (a)) * ((b) - (a)) + ((d) - (c) ) » ( (d) - (c))) 

float revcst(float x[] , float y[], int iorder[], int ncity, int n[]) 

This function returns the value of the cost function for a proposed path reversal, ncity is the 
number of cities, and arrays x[l. .ncity] ,y[l. .ncity] give the coordinates of these cities, 
iorder [1. .ncity] holds the present itinerary. The first two values n[l] and n[2] of array 
n give the starting and ending cities along the path segment which is to be reversed. On output, 
de is the cost of making the reversal. The actual reversal is not performed by this routine. 

{ 

float xx [5] , yy [5] , de; 
int j,ii; 


n[3]=l + ((n[l]+ncity-2) l ncity); 
n[4]=l + (n[2] % ncity); 
for (j=l;j<=4;j++) { 
ii=iorder [n[j]] ; 
xx[j] =x[ii] ; 
yy[j]=y[ii]; 

> 

de = -ALEN(xx[l] ,xx[3] ,yy[l] ,yy[3]) 
de -= ALEN (xx [2] , xx [4] , yy [2] , yy [4] ) 
de += ALEN(xx[l] ,xx[4] ,yy[l] ,yy [4] ) 
de += ALEN (xx [2] , xx [3] , yy [2] , yy [3] ) 
return de; 


Find the city before n[l] .. 

.. and the city after n[2]. 

Find coordinates for the four cities in¬ 
volved. 


Calculate cost of disconnecting the seg¬ 
ment at both ends and reconnecting 
in the opposite order. 


void reverse(int iorderL] , int ncity, int n[]) 

This routine performs a path segment reversal, iorder [1. .ncity] is an input array giving the 
present itinerary. The vector n has as its first four elements the first and last cities n[l] ,n[2] 
of the path segment to be reversed, and the two cities n[3] and n[4] that immediately 
precede and follow this segment. n[3] and n[4] are found by function revest. On output, 
iorder [1. .ncity] contains the segment from n[l] to n[2] in reversed order. 

{ 

int nn,j,k,1,itmp; 


nn=(l+((n[2]-n[l]+ncity) '/, ncity))/2; 
for (j=l;j<=nn;j++) { 

k=l + ((n[l]+j-2) '/. ncity); 

1=1 + ((n[2]-j+ncity) ’/, ncity); 
itmp=iorder [k]; 
iorder[k]=iorder [1]; 
iorder[1]=itmp; 


This many cities must be swapped to 
effect the reversal. 

Start at the ends of the segment and 
swap pairs of cities, moving toward 
the center. 



> 


> 


imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 




450 


Chapter 10. Minimization or Maximization of Functions 


#include <math.h> 

#def ine ALEN (a ,b, c, d) sqrt (((b) - (a)) * ((b) - (a)) + ((d) - (c) ) * ( (d) - (c))) 

float trncst(float x[], float y[], int iorderG, int ncity, int n[]) 

This routine returns the value of the cost function for a proposed path segment transport, ncity 
is the number of cities, and arrays x [1. .ncity] and y [ 1. .ncity] give the city coordinates. 
iorder[l. .ncity] is an array giving the present itinerary. The first three elements of array 
n give the starting and ending cities of the path to be transported, and the point among the 
remaining cities after which it is to be inserted. On output, de is the cost of the change. The 
actual transport is not performed by this routine. 

{ 

float xx [7] , yy [7] , de; 
int j,ii; 

n[4]=l + (n[3] i ncity); 
n[5]=l + ((n[l]+ncity-2) '/. ncity); 
n[6]=l + (n[2] ’/„ ncity); 
for (j=l;j<=6;j++) { 
ii=iorder [n[j]] ; 
xx[j] =x[ii] ; 
yy[j]=y[ii]; 

> 

de = -ALEN (xx [2] , xx [6] , yy [2] , yy [6]) 

de -= ALEN(xx[l] ,xx[5] ,yy[l] ,yy [5] ) 

de -= ALEN (xx [3] , xx [4] , yy [3] , yy [4] ) 

de += ALEN (xx [1] , xx [3] , yy [1] , yy [3] ) 

de += ALEN (xx [2] , xx [4] , yy [2] , yy [4] ) 

de += ALEN(xx[5] ,xx[6] ,yy[5] ,yy [6] ) 

return de; 

> 


#include "nrutil.h" 
void trnspt(int iorderG, int ncity, int nQ) 

This routine does the actual path transport, once metrop has approved. iorder[l. .ncity] 
is an input array giving the present itinerary. The array n has as its six elements the beginning 
n[l] and end n[2] of the path to be transported, the adjacent cities n[3] and n[4] between 
which the path is to be placed, and the cities n[5] and n[6] that precede and follow the path. 
n[4], n[5] , and n[6] are calculated by function trncst. On output, iorder is modified to 
reflect the movement of the path segment. 

{ 

int ml,m2,m3,nn,j,jj,*jorder; 

jorder=ivector(l,ncity); 
ml=l + ((n[2]-n[l]+ncity) ’/, ncity) 
m2=l + ((n[5]-n[4]+ncity) ’/, ncity) 
m3=l + ((n[3]-n[6]+ncity) ’/, ncity) 
nn=l; 

for (j=l;j<=ml;j++) { 

jj=l + ((j+n[l]-2) "/„ ncity); 
jorder [nn++] =iorder [j j] ; 

> 

for (j=l;j<=m2;j++) { 

jj=l+((j+n[4]-2) ’/.ncity); 
j order [nn++] =iorder [j j ] ; 

> 

for (j=l;j<=m3;j++) { 

jj=l + ((j+n[6]-2) ’/. ncity); 
j order [nn++] =iorder [j j ] ; 

> 

for (j=l;j<=ncity;j++) 
iorder [j] =jorder [j] ; 


Find number of cities from n[l] ton [2] 
...and the number from n[4] to n[5] 
...and the number from n[6] to n[3]. 


Copy the chosen segment. 


Then copy the segment from n[4] to 
n[5]. 


Finally, the segment from n[6] to n[3] . 


Find the city following n[3].. 

..and the one preceding n[l].. 

..and the one following n[2]. 

Determine coordinates for the six cities 
involved. 


Calculate the cost of disconnecting the 
path segment from n[l] to n[2], 
opening a space between n[3] and 
n[4], connecting the segment in the 
space, and connecting n [5] ton[6]. 



Copy jorder back into iorder. 
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free ivector(jorder,l,ncity); 

> 


#include <math.h> 

int metrop(float de, float t) 

Metropolis algorithm, metrop returns a boolean variable that issues a verdict on whether 
to accept a reconfiguration that leads to a change de in the objective function e. If de<0, 
metrop = 1 (true), while if de>0, metrop is only true with probability exp(-de/t), where 
t is a temperature determined by the annealing schedule. 

{ 

float ran3(long *idum); 
static long gljdum=l; 

return de < 0.0 II ran3(&gljdum) < exp(-de/t); 

> 


Continuous Minimization by Simulated Annealing 

The basic ideas of simulated annealing are also applicable to optimization 
problems with continuous A-dimensional control spaces, e.g., finding the (ideally, 
global) minimum of some function /(x), in the presence of many local minima, 
where x is an TV-dimensional vector. The four elements required by the Metropolis 
procedure are now as follows: The value of / is the objective function. The 
system state is the point x. The control parameter T is, as before, something like a 
temperature, with an annealing schedule by which it is gradually reduced. And there 
must be a generator of random changes in the configuration, that is, a procedure for 
taking a random step from x to x + Ax. 

The last of these elements is the most problematical. The literature to date [7-10] 
describes several different schemes for choosing Ax, none of which, in our view, 
inspire complete confidence. The problem is one of efficiency: A generator of 
random changes is inefficient if, when local downhill moves exist, it nevertheless 
almost always proposes an uphill move. A good generator, we think, should not 
become inefficient in narrow valleys; nor should it become more and more inefficient 
as convergence to a minimum is approached. Except possibly for [7], all of the 
schemes that we have seen are inefficient in one or both of these situations. 

Our own way of doing simulated annealing minimization on continuous control 
spaces is to use a modification of the downhill simplex method (§ 10.4). This amounts 
to replacing the single point x as a description of the system state by a simplex of 
N + 1 points. The “moves” are the same as described in §10.4, namely reflections, 
expansions, and contractions of the simplex. The implementation of the Metropolis 
procedure is slightly subtle: We add a positive, logarithmically distributed random 
variable, proportional to the temperature T, to the stored function value associated 
with every vertex of the simplex, and we subtract a similar random variable from 
the function value of every new point that is tried as a replacement point. Like the 
ordinary Metropolis procedure, this method always accepts a true downhill step, but 
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sometimes accepts an uphill one. In the limit T —> 0, this algorithm reduces exactly 
to the downhill simplex method and converges to a local minimum. 

At a finite value of T, the simplex expands to a scale that approximates the size 
of the region that can be reached at this temperature, and then executes a stochastic, 
tumbling Brownian motion within that region, sampling new, approximately random, 
points as it does so. The efficiency with which a region is explored is independent 
of its narrowness (for an ellipsoidal valley, the ratio of its principal axes) and 
orientation. If the temperature is reduced sufficiently slowly, it becomes highly 
likely that the simplex will shrink into that region containing the lowest relative 
minimum encountered. 

As in all applications of simulated annealing, there can be quite a lot of 
problem-dependent subtlety in the phrase “sufficiently slowly”; success or failure 
is quite often determined by the choice of annealing schedule. Here are some 
possibilities worth trying: 

• Reduce T to (1 — e)T after every m moves, where e/m is determined 
by experiment. 

• Budget a total of K moves, and reduce T after every m moves to a value 
T = To(l — k/K) a , where k is the cumulative number of moves thus far, 
and a is a constant, say 1, 2, or 4. The optimal value for a depends on the 
statistical distribution of relative minima of various depths. Larger values 
of a spend more iterations at lower temperature. 

• After every m moves, set T to (3 times /i — /&, where (3 is an experimentally 
determined constant of order 1, /i is the smallest function value currently 
represented in the simplex, and ft, is the best function ever encountered. 
However, never reduce T by more than some fraction 7 at a time. 

Another strategic question is whether to do an occasional restart, where a vertex 
of the simplex is discarded in favor of the “best-ever” point. (You must be sure that 
the best-ever point is not currently in the simplex when you do this!) We have found 
problems for which restarts — every time the temperature has decreased by a factor 
of 3, say — are highly beneficial; we have found other problems for which restarts 
have no positive, or a somewhat negative, effect. 

You should compare the following routine, amebsa, with its counterpart amoeba 
in §10.4. Note that the argument iter is used in a somewhat different manner. 

#include <math.h> 

#include "nrutil.h" 

#define GET.PSUM \ 

for (n=l;n<=ndim;n++) {\ 

for (sum=0.0,m=l;m<=mpts;m++) sum += p[m][n];\ 
psum[n]=sum;> 

extern long idum; Defined and initialized in main, 

float tt; Communicates with amotsa. 

void amebsa(float **p, float y[], int ndim, float pb[], float *yb, float ftol, 
float (*funk)(float []), int *iter, float temptr) 

Multidimensional minimization of the function funk(x) where x[l..ndim] is a vector in 
ndim dimensions, by simulated annealing combined with the downhill simplex method of Nelder 
and Mead. The input matrix p[l. .ndim+1] [1. .ndim] has ndim+1 rows, each an ndim- 
dimensional vector which is a vertex of the starting simplex. Also input are the following: the 
vector y[l. .ndim+1] , whose components must be pre-initialized to the values of funk eval¬ 
uated at the ndim+1 vertices (rows) of p; ftol, the fractional convergence tolerance to be 
achieved in the function value for an early return; iter, and temptr. The routine makes iter 
function evaluations at an annealing temperature temptr, then returns. You should then de- 
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crease temptr according to your annealing schedule, reset iter, and call the routine again 
(leaving other arguments unaltered between calls). If iter is returned with a positive value, 
then early convergence and return occurred. If you initialize yb to a very large value on the first 
call, then yb and pb[l. .ndim] will subsequently return the best function value and point ever 
encountered (even if it is no longer a point in the simplex). 

{ 

float amotsa(float **p, float y[], float psum[], int ndim, float pb[], 
float *yb, float (*funk)(float []), int ihi, float *yhi, float fac); 
float rani(long *idum); 
int i,ihi,ilo,j,m,n,mpts=ndim+l; 

float rtol,sum,swap,yhi,ylo,ynhi,ysave,yt,ytry,*psum; 


psum=vector(1,ndim); 
tt = -temptr; 

GET.PSUM 
for (;;) { 
ilo=l; 
ihi=2; 

ynhi=ylo=y[1]+tt*log(ranl(feidum)); 
yhi=y[2]+tt*log(ranl(feidum)); 
if (ylo > yhi) { 
ihi=l; 
ilo=2; 
ynhi=yhi; 
yhi=ylo; 
ylo=ynhi; 

> 

for (i=3;i<=mpts;i++) { 

yt=y[i]+tt*log(rani(feidum)); 
if (yt <= ylo) { 
ilo=i; 
ylo=yt; 

> 

if (yt > yhi) { 
ynhi=yhi; 


Determine which point is the highest (worst), 
next-highest, and lowest (best). 
Whenever we “look at" a vertex, it gets 
a random thermal fluctuation. 


Loop over the points in the simplex. 
More thermal fluctuations. 


ihi=i; 
yhi=yt; 

> else if (yt > ynhi) { 

ynhi=yt; 

> 

> 


rtol=2.0*fabs(yhi-ylo)/(fabs(yhi)+fabs(ylo)); 

Compute the fractional range from highest to lowest and return if satisfactory, 
if (rtol < ftol | | *iter < 0) { If returning, put best point and value in 

swap=y[l] ; slot 1. 

y [l]=y[ilo] ; 
y [ilo]=swap; 
for (n=l ;n<=ndim;n++) { 
swap=p [1] [n] ; 
p [ID [n] =p [ilo] [n] ; 
p[ilo] [n]=swap; 

> 

break; 


*iter -= 2; 

Begin a new iteration. First extrapolate by a factor —1 through the face of the simplex 
across from the high point, i.e., reflect the simplex from the high point. 
ytry=amotsa(p,y,psum,ndim,pb,yb,funk,ihi,&yhi,-l.0); 
if (ytry <= ylo) { 

Gives a result better than the best point, so try an additional extrapolation by a 
factor of 2. 

ytry=amot sa(p,y,psum,ndim,pb,yb,funk,ihi,feyhi,2.0); 

> else if (ytry >= ynhi) { 

The reflected point is worse than the second-highest, so look for an intermediate 
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lower point, i.e., do a one-dimensional contraction. 
ysave=yhi; 

ytry=amot sa(p,y,psum,ndim,pb,yb,funk,ihi,&yhi,0.5); 
if (ytry >= ysave) { Can't seem to get rid of that high point, 

for (i=l;i<=mpts;i++) { Better contract around the lowest 

if (i != ilo) { (best) point, 

for (j=l;j<=ndim;j++) { 

psum[j]=0.5*(p[i] [j]+p[ilo] [j]) ; 
p[i] [j]=psum[j] ; 

> 

y [i] = (*funk)(psum); 

> 

> 

♦iter -= ndim; 

GET.PSUM Recompute psum. 

> 

> else ++(*iter); Correct the evaluation count. 

> 

free_vector(psum,l,ndim); 


#include <math.h> 

#include "nrutil.h" 

extern long idum; Defined and initialized in main, 

extern float tt; Defined in amebsa. 

float amotsa(float **p, float y[], float psum[] , int ndim, float pb[], 
float *yb, float (*funk)(float []), int ihi, float *yhi, float fac) 
Extrapolates by a factor fac through the face of the simplex across from the high point, tries 
it, and replaces the high point if the new point is better. 

{ 

float rani(long *idum); 
int j; 

float facl,fac2,yflu,ytry,*ptry; 

ptry=vector(1,ndim); 
facl=(l.0-fac)/ndim; 
fac2=facl-fac; 
for (j=l;j<=ndim;j++) 

ptry [j] =psum[j] *facl-p[ihi] [j] *fac2; 
ytry=(*funk)(ptry); 

if (ytry <= *yb) { Save the best-ever. 

for (j=l;j<=ndim;j++) pb[j]=ptry[j] ; 

*yb=ytry; 

> 

yf lu=ytry-tt*log(ranl (feidum)); We added a thermal fluctuation to all the current 

if (yf lu < *yhi) { vertices, but we subtract it here, so as to give 

y[ihi]=ytry; the simplex a thermal Brownian motion: It 

*yhi=yflu; likes to accept any suggested change, 

for (j=l;j<=ndim;j++) { 

psum[j] += ptry [j]-p[ihi] [j] ; 
p [ihi] [j]=ptry[j] ; 

> 

> 

free_vector(ptry,1,ndim); 
return yflu; 



There is not yet enough practical experience with the method of simulated 
annealing to say definitively what its future place among optimization methods 
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will be. The method has several extremely attractive features, rather unique when 
compared with other optimization techniques. 

First, it is not “greedy,” in the sense that it is not easily fooled by the quick 
payoff achieved by falling into unfavorable local minima. Provided that sufficiently 
general reconfigurations are given, it wanders freely among local minima of depth 
less than about T. As T is lowered, the number of such minima qualifying for 
frequent visits is gradually reduced. 

Second, configuration decisions tend to proceed in a logical order. Changes 
that cause the greatest energy differences are sifted over when the control parameter 
T is large. These decisions become more permanent as T is lowered, and attention 
then shifts more to smaller refinements in the solution. For example, in the traveling 
salesman problem with the Mississippi River twist, if A is large, a decision to cross 
the Mississippi only twice is made at high T, while the specific routes on each side 
of the river are determined only at later stages. 

The analogies to thermodynamics may be pursued to a greater extent than we 
have done here. Quantities analogous to specific heat and entropy may be defined, 
and these can be useful in monitoring the progress of the algorithm towards an 
acceptable solution. Information on this subject is found in [1], 
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An N x N matrix A is said to have an eigenvector x and corresponding 
eigenvalue A if 


A • x = Ax (11.0.1) 

Obviously any multiple of an eigenvector x will also be an eigenvector, but we 
won’t consider such multiples as being distinct eigenvectors. (The zero vector is not 
considered to be an eigenvector at all.) Evidently (11.0.1) can hold only if 

det |A — Al| = 0 (11.0.2) 

which, if expanded out, is an Nth degree polynomial in A whose roots are the eigen¬ 
values. This proves that there are always N (not necessarily distinct) eigenvalues. 
Equal eigenvalues coming from multiple roots are called degenerate. Root-searching 
in the characteristic equation (11.0.2) is usually a very poor computational method 
for finding eigenvalues. We will learn much better ways in this chapter, as well as 
efficient ways for finding corresponding eigenvectors. 

The above two equations also prove that every one of the N eigenvalues has 
a (not necessarily distinct) corresponding eigenvector: If A is set to an eigenvalue, 
then the matrix A — A1 is singular, and we know that every singular matrix has at 
least one nonzero vector in its nullspace (see §2.6 on singular value decomposition). 

If you add rx to both sides of (11.0.1), you will easily see that the eigenvalues 
of any matrix can be changed or shifted by an additive constant r by adding to the 
matrix that constant times the identity matrix. The eigenvectors are unchanged by 
this shift. Shifting, as we will see, is an important part of many algorithms for 
computing eigenvalues. We see also that there is no special significance to a zero 
eigenvalue. Any eigenvalue can be shifted to zero, or any zero eigenvalue can be 
shifted away from zero. 



456 
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Definitions and Basic Facts 

A matrix is called symmetric if it is equal to its transpose, 

A = A t or ciij = a,ji (11.0.3) 

It is called Hermitian or self-adjoint if it equals the complex-conjugate of its transpose 
(its Hermitian conjugate, denoted by “f”) 

A = At or ai j = aji* (11.0.4) 

It is termed orthogonal if its transpose equals its inverse, 

A t • A = A • A t = 1 (11.0.5) 

and unitary if its Hermitian conjugate equals its inverse. Finally, a matrix is called 
normal if it commutes with its Hermitian conjugate, 

A • At = At-A (11.0.6) 

For real matrices, Hermitian means the same as symmetric, unitary means the 
same as orthogonal, and both of these distinct classes are normal. 

The reason that “Hermitian” is an important concept has to do with eigenvalues. 
The eigenvalues of a Hermitian matrix are all real. In particular, the eigenvalues 
of a real symmetric matrix are all real. Contrariwise, the eigenvalues of a real 
nonsymmetric matrix may include real values, but may also include pairs of complex 
conjugate values; and the eigenvalues of a complex matrix that is not Hermitian 
will in general be complex. 

The reason that “normal” is an important concept has to do with the eigen¬ 
vectors. The eigenvectors of a normal matrix with nondegenerate (i.e., distinct) 
eigenvalues are complete and orthogonal, spanning the A'-dimensional vector space. 
For a normal matrix with degenerate eigenvalues, we have the additional freedom of 
replacing the eigenvectors corresponding to a degenerate eigenvalue by linear com¬ 
binations of themselves. Using this freedom, we can always perform Gram-Schmidt 
orthogonalization (consult any linear algebra text) and find a set of eigenvectors that 
are complete and orthogonal, just as in the nondegenerate case. The matrix whose 
columns are an orthonormal set of eigenvectors is evidently unitary. A special case 
is that the matrix of eigenvectors of a real, symmetric matrix is orthogonal, since 
the eigenvectors of that matrix are all real. 

When a matrix is not normal, as typified by any random, nonsymmetric, real 
matrix, then in general we cannot find any orthonormal set of eigenvectors, nor even 
any pairs of eigenvectors that are orthogonal (except perhaps by rare chance). While 
the N non-orthonormal eigenvectors will “usually” span the A r -dimensional vector 
space, they do not always do so; that is, the eigenvectors are not always complete. 
Such a matrix is said to be defective. 
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Left and Right Eigenvectors 


While the eigenvectors of a non-normal matrix are not particularly orthogonal 
among themselves, they do have an orthogonality relation with a different set of 
vectors, which we must now define. Up to now our eigenvectors have been column 
vectors that are multiplied to the right of a matrix A, as in (11.0.1). These, more 
explicitly, are termed right eigenvectors. We could also, however, try to find row 
vectors, which multiply A to the left and satisfy 

x • A = Ax (11.0.7) 

These are called left eigenvectors. By taking the transpose of equation (11.0.7), we 
see that every left eigenvector is the transpose of a right eigenvector of the transpose 
of A. Now by comparing to (11.0.2), and using the fact that the determinant of a 
matrix equals the determinant of its transpose, we also see that the left and right 
eigenvalues of A are identical. 

If the matrix A is symmetric, then the left and right eigenvectors are just 
transposes of each other, that is, have the same numerical values as components. 
Likewise, if the matrix is self-adjoint, the left and right eigenvectors are Hermitian 
conjugates of each other. For the general nonnormal case, however, we have the 
following calculation: Let X R be the matrix formed by columns from the right 
eigenvectors, and Xl be the matrix formed by rows from the left eigenvectors. Then 
(11.0.1) and (11.0.7) can be rewritten as 

A • X R = X R • diag(Ai... Aw) X L • A = diag(Ai... Aw) • X L (11.0.8) 

Multiplying the first of these equations on the left by Xl, the second on the right 
by X/j, and subtracting the two, gives 



(X L • X R ) ■ diag(Ai... Aw) = diag(Ai... Aw) • (X L ■ X R ) (11.0.9) 

This says that the matrix of dot products of the left and right eigenvectors commutes 
with the diagonal matrix of eigenvalues. But the only matrices that commute with a 
diagonal matrix of distinct elements are themselves diagonal. Thus, if the eigenvalues 
are nondegenerate, each left eigenvector is orthogonal to all right eigenvectors except 
its corresponding one, and vice versa. By choice of normalization, the dot products 
of corresponding left and right eigenvectors can always be made unity for any matrix 
with nondegenerate eigenvalues. 

If some eigenvalues are degenerate, then either the left or the right eigenvec¬ 
tors corresponding to a degenerate eigenvalue must be linearly combined among 
themselves to achieve orthogonality with the right or left ones, respectively. This 
can always be done by a procedure akin to Gram-Schmidt orthogonalization. The 
normalization can then be adjusted to give unity for the nonzero dot products between 
corresponding left and right eigenvectors. If the dot product of corresponding left and 
right eigenvectors is zero at this stage, then you have a case where the eigenvectors 
are incomplete! Note that incomplete eigenvectors can occur only where there are 
degenerate eigenvalues, but do not always occur in such cases (in fact, never occur 
for the class of “normal” matrices). See [1 ] for a clear discussion. 



S, § g 
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In both the degenerate and nondegenerate cases, the final normalization to 
unity of all nonzero dot products produces the result: The matrix whose rows 
are left eigenvectors is the inverse matrix of the matrix whose columns are right 
eigenvectors, if the inverse exists. 


Diagonalization of a Matrix 


Multiplying the first equation in (11.0.8) by X l, and using the fact that X i 
and Xft are matrix inverses, we get 

Xjj 1 • A • X R = diag(Ai... Ajv) (11.0.10) 

This is a particular case of a similarity transform of the matrix A, 

A -> Z 1 A Z (11.0.11) 

for some transformation matrix Z. Similarity transformations play a crucial role 
in the computation of eigenvalues, because they leave the eigenvalues of a matrix 
unchanged. This is easily seen from 

det \Z^ • A • Z - Al| = det \Z^ • (A - Al) • Z| 

= det|Z| det |A — Al| det |Z —1 1 (11.0.12) 

= det |A — Al| 

Equation (11.0.10) shows that any matrix with complete eigenvectors (which includes 
all normal matrices and “most” random nonnormal ones) can be diagonalized by a 
similarity transformation, that the columns of the transformation matrix that effects 
the diagonalization are the right eigenvectors, and that the rows of its inverse are 
the left eigenvectors. 

For real, symmetric matrices, the eigenvectors are real and orthonormal, so the 
transformation matrix is orthogonal. The similarity transformation is then also an 
orthogonal transformation of the form 

A -> Z t -A-Z (11.0.13) 

While real nonsymmetric matrices can be diagonalized in their usual case of complete 
eigenvectors, the transformation matrix is not necessarily real. It turns out, however, 
that a real similarity transformation can “almost” do the job. It can reduce the matrix 
down to a form with little two-by-two blocks along the diagonal, all other elements 
zero. Each two-by-two block corresponds to a complex-conjugate pair of complex 
eigenvalues. We will see this idea exploited in some routines given later in the chapter. 

The “grand strategy” of virtually all modern eigensystem routines is to nudge 
the matrix A towards diagonal form by a sequence of similarity transformations, 

A -> P ^ 1 A Pi -> Pj 1 P(' A Pi Ps 



P 3 P 2 Pi A P: P 2 P 3 


etc. 


(11.0.14) 
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If we get all the way to diagonal form, then the eigenvectors are the columns of 
the accumulated transformation 


X ii = P 1 P 2 P 3 ... (11.0.15) 

Sometimes we do not want to go all the way to diagonal form. For example, if we are 
interested only in eigenvalues, not eigenvectors, it is enough to transform the matrix 
A to be triangular, with all elements below (or above) the diagonal zero. In this 
case the diagonal elements are already the eigenvalues, as you can see by mentally 
evaluating (11.0.2) using expansion by minors. 

There are two rather different sets of techniques for implementing the grand 
strategy (11.0.14). It turns out that they work rather well in combination, so most 
modern eigensystem routines use both. The first set of techniques constructs individ¬ 
ual P^s as explicit “atomic” transformations designed to perform specific tasks, for 
example zeroing a particular off-diagonal element (Jacobi transformation, § 11.1), or 
a whole particular row or column (Householder transformation, §11.2; elimination 
method, §11.5). In general, a finite sequence of these simple transformations cannot 
completely diagonalize a matrix. There are then two choices: either use the finite 
sequence of transformations to go most of the way (e.g., to some special form like 
tridiagonal or Hessenberg, see § 11.2 and § 11.5 below) and follow up with the second 
set of techniques about to be mentioned; or else iterate the finite sequence of simple 
transformations over and over until the deviation of the matrix from diagonal is 
negligibly small. This latter approach is conceptually simplest, so we will discuss 
it in the next section; however, for N greater than ~ 10, it is computationally 
inefficient by a roughly constant factor ~ 5. 

The second set of techniques, called factorization methods, is more subtle. 
Suppose that the matrix A can be factored into a left factor F /, and a right factor 
F/{. Then 

A = Fi-F# or equivalently F^ 1 A = F fl (11.0.16) 

If we now multiply back together the factors in the reverse order, and use the second 
equation in (11.0.16) we get 


Fij-F L = F^ 1 -A-F i (11.0.17) 

which we recognize as having effected a similarity transformation on A with the 
transformation matrix being F l ! In §11.3 and §11.6 we will discuss the QR method 
which exploits this idea. 

Factorization methods also do not converge exactly in a finite number of 
transformations. But the better ones do converge rapidly and reliably, and, when 
following an appropriate initial reduction by simple similarity transformations, they 
are the methods of choice. 
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“Eigenpackages of Canned Eigenroutines” 

You have probably gathered by now that the solution of eigensystems is a fairly 
complicated business. It is. It is one of the few subjects covered in this book for 
which we do not recommend that you avoid canned routines. On the contrary, the 
purpose of this chapter is precisely to give you some appreciation of what is going 
on inside such canned routines, so that you can make intelligent choices about using 
them, and intelligent diagnoses when something goes wrong. 

You will find that almost all canned routines in use nowadays trace their ancestry 
back to routines published in Wilkinson and Reinsch’s Handbook for Automatic 
Computation, Vol. II, Linear Algebra [2], This excellent reference, containing papers 
by a number of authors, is the Bible of the field. A public-domain implementation 
of the Handbook routines in FORTRAN is the EISPACK set of programs [3], The 
routines in this chapter are translations of either the Handbook or EISPACK routines, 
so understanding these will take you a lot of the way towards understanding those 
canonical packages. 

IMSL [4] and NAG [5] each provide proprietary implementations, in FORTRAN, 
of what are essentially the Handbook routines. 

A good “eigenpackage” will provide separate routines, or separate paths through 
sequences of routines, for the following desired calculations: 

• all eigenvalues and no eigenvectors 

• all eigenvalues and some corresponding eigenvectors 

• all eigenvalues and all corresponding eigenvectors 

The purpose of these distinctions is to save compute time and storage; it is wasteful 
to calculate eigenvectors that you don’t need. Often one is interested only in 
the eigenvectors corresponding to the largest few eigenvalues, or largest few in 
magnitude, or few that are negative. The method usually used to calculate “some” 
eigenvectors is typically more efficient than calculating all eigenvectors if you desire 
fewer than about a quarter of the eigenvectors. 

A good eigenpackage also provides separate paths for each of the above 
calculations for each of the following special forms of the matrix: 

• real, symmetric, tridiagonal 

• real, symmetric, banded (only a small number of sub- and superdiagonals 
are nonzero) 

• real, symmetric 

• real, nonsymmetric 

• complex, Hermitian 

• complex, non-Hermitian 

Again, the purpose of these distinctions is to save time and storage by using the least 
general routine that will serve in any particular application. 

In this chapter, as a bare introduction, we give good routines for the following 
paths: 

• all eigenvalues and eigenvectors of a real, symmetric, tridiagonal matrix 
(§11.3) 

• all eigenvalues and eigenvectors of a real, symmetric, matrix (§11.1—§11.3) 

• all eigenvalues and eigenvectors of a complex, Hermitian matrix 
(§11.4) 

• all eigenvalues and no eigenvectors of a real, nonsymmetric matrix (§11.5— 
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§ 11 . 6 ) 

We also discuss, in §11.7, how to obtain some eigenvectors of nonsymmetric 
matrices by the method of inverse iteration. 

Generalized and Nonlinear Eigenvalue Problems 

Many eigenpackages also deal with the so-called generalized eigenproblem, [6] 

A • x = AB • x (11.0.18) 

where A and B are both matrices. Most such problems, where B is nonsingular, 
can be handled by the equivalent 

(B -1 • A) • x = Ax (11.0.19) 

Often A and B are symmetric and B is positive definite. The matrix B 1 A in 
(11.0.19) is not symmetric, but we can recover a symmetric eigenvalue problem 
by using the Cholesky decomposition B = L L T of §2.9. Multiplying equation 
(11.0.18) by L -1 , we get 


C- (L r -x) = A(L t -x) (11.0.20) 

where 

C = L' 1 • A • (L -1 ) t (11.0.21) 

The matrix C is symmetric and its eigenvalues are the same as those of the original 
problem (11.0.18); its eigenfunctions are L T • x. The efficient way to form C is 
first to solve the equation 


Y-L t =A (11.0.22) 

for the lower triangle of the matrix Y. Then solve 

L • C = Y (11.0.23) 

for the lower triangle of the symmetric matrix C. 

Another generalization of the standard eigenvalue problem is to problems 
nonlinear in the eigenvalue A, for example, 

(AA 2 +BA + C)-x = 0 (11.0.24) 


This can be turned into a linear problem by introducing an additional unknown 
eigenvector y and solving the 2 N x 2 N eigensystem, 


0 

—A" 1 


C 



(11.0.25) 



This technique generalizes to higher-order polynomials in A. A polynomial of degree 
M produces a linear MN x MN eigensystem (see [7]). 
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11.1 Jacobi Transformations of a Symmetric 
Matrix 


The Jacobi method consists of a sequence of orthogonal similarity transforma¬ 
tions of the form of equation (11.0.14). Each transformation (a Jacobi rotation) is 
just a plane rotation designed to annihilate one of the off-diagonal matrix elements. 
Successive transformations undo previously set zeros, but the off-diagonal elements 
nevertheless get smaller and smaller, until the matrix is diagonal to machine preci¬ 
sion. Accumulating the product of the transformations as you go gives the matrix 
of eigenvectors, equation (11.0.15), while the elements of the final diagonal matrix 
are the eigenvalues. 

The Jacobi method is absolutely foolproof for all real symmetric matrices. For 
matrices of order greater than about 10 , say, the algorithm is slower, by a significant 
constant factor, than the QR method we shall give in §11.3. However, the Jacobi 
algorithm is much simpler than the more efficient methods. We thus recommend it 
for matrices of moderate order, where expense is not a major consideration. 

The basic Jacobi rotation P pq is a matrix of the form 


1 


P, M = 



( 11 . 1 . 1 ) 



L U 

Here all the diagonal elements are unity except for the two elements c in rows (and 
columns) p and q. All off-diagonal elements are zero except the two elements s and 
—s. The numbers c and s are the cosine and sine of a rotation angle <i>, so c 2 + .s 2 = 1. 
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A plane rotation such as (11.1.1) is used to transform the matrix A according to 

A' = Pj g • A ■ P pq (11.1.2) 

Now, P pq ■ A changes only rows p and q of A, while A • P pq changes only columns 
p and q. Notice that the subscripts p and q do not denote components of P pq , but 
rather label which kind of rotation the matrix is, i.e., which rows and columns it 
affects. Thus the changed elements of A in (11.1.2) are only in the p and q rows 
and columns indicated below: 


a 'ql 


Multiplying out equation (11.1.2) and using the symmetry of A, we get the explicit 
formulas 





t 2 + 2t6 - 1 = 0 


(11.1.9) 


The smaller root of this equation corresponds to a rotation angle less than 7r/4 
in magnitude; this choice at each stage gives the most stable reduction. Using the 
form of the quadratic formula with the discriminant in the denominator, we can 
write this smaller root as 


sgn(6>) 

|0| + \/FTT 
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If 9 is so large that 9 2 would overflow on the computer, we set t = 1/(2 9). It 
now follows that 


c = 


s = tc 


(11 1 11 ) 

( 11 . 1 . 12 ) 


When we actually use equations (11.1.4)-( 11.1.7) numerically, we rewrite them 
to minimize roundoff error. Equation (11.1.7) is replaced by 

a' pq = 0 (H.1.13) 

The idea in the remaining equations is to set the new quantity equal to the old 
quantity plus a small correction. Thus we can use (11.1.7) and (11.1.13) to eliminate 
a qq from (11.1.5), giving 


Similarly, 


tCLpq 

(11.1.14) 

+ tCLpq 

(11-1.15) 

S^drq + TCLrp) 

(11.1.16) 

+ s(a rp — ra rq ) 

(11-1.17) 


where r (= tan 0/2) is defined by 


r = 


s 

1 + c 


(11.1.18) 


One can see the convergence of the Jacobi method by considering the sum of 
the squares of the off-diagonal elements 


s = £k.l 2 

rjts 


Equations (11.1.4)—(11.1.7) imply that 


S' = S-2\a pq \ 2 


(11.1.19) 


( 11 . 1 . 20 ) 


(Since the transformation is orthogonal, the sum of the squares of the diagonal 
elements increases correspondingly by 2\a pq \ 2 .) The sequence of S’s thus decreases 
monotonically. Since the sequence is bounded below by zero, and since we can 
choose a pq to be whatever element we want, the sequence can be made to converge 
to zero. 



Eventually one obtains a matrix D that is diagonal to machine precision. The 
diagonal elements give the eigenvalues of the original matrix A, since 


D = V T A V 


( 11 . 1 . 21 ) 
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where 

V = P! P 2 P 3 (11.1.22) 

the P^s being the successive Jacobi rotation matrices. The columns of V are the 
eigenvectors (since A • V = V D). They can be computed by applying 

V' = V P, (11.1.23) 

at each stage of calculation, where initially V is the identity matrix. In detail, 
equation (11.1.23) is 


v' rs = v rs (s ^ p, s ^ q) 

K P = cv rp - sv rq (11.1.24) 

v' rq = SV rp + CV rq 

We rewrite these equations in terms of r as in equations (11.1.16) and (11.1.17) 
to minimize roundoff. 

The only remaining question is the strategy one should adopt for the order in 
which the elements are to be annihilated. Jacobi’s original algorithm of 1846 searched 
the whole upper triangle at each stage and set the largest off-diagonal element to zero. 
This is a reasonable strategy for hand calculation, but it is prohibitive on a computer 
since the search alone makes each Jacobi rotation a process of order N 2 instead of N. 

A better strategy for our purposes is the cyclic Jacobi method , where one 
annihilates elements in strict order. For example, one can simply proceed down 
the rows: Pi2,Pi3, ...,Pi n ; then P23. P24, etc. One can show that convergence 
is generally quadratic for both the original or the cyclic Jacobi methods, for 
nondegenerate eigenvalues. One such set of n(n — l)/2 Jacobi rotations is called 
a sweep. 

The program below, based on the implementations in [1,2], uses two further 
refinements: 

• In the first three sweeps, we carry out the pq rotation only if \a pq \ > e 
for some threshold value 


lSo 
5 n 2 


where So is the sum of the off-diagonal moduli. 


So = ^2 | a, 


(11.1.25) 


(11.1.26) 


• After four sweeps, if \a pq \ -C \a pp \ and \a pq \ <C |o gg |, we set |a pg | = 0 
and skip the rotation. The criterion used in the comparison is |o pg | < 
IQ—(£>+ 2 ) | ^ w }j ere jj j s th e num ber of significant decimal digits on the 

machine, and similarly for \a qq \. 
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In the following routine the nxn symmetric matrix a is stored as a[l. .n] 
[1.. n]. On output, the superdiagonal elements of a are destroyed, but the diagonal 
and subdiagonal are unchanged and give full information on the original symmetric 
matrix a. The vector d [1. . n] returns the eigenvalues of a. During the computation, 
it contains the current diagonal of a. The matrix v[l. .n] [1. .n] outputs the 
normalized eigenvector belonging to d [k] in its kth column. The parameter nrot is 
the number of Jacobi rotations that were needed to achieve convergence. 

Typical matrices require 6 to 10 sweeps to achieve convergence, or 3 n 2 to 5n 2 
Jacobi rotations. Each rotation requires of order An operations, each consisting 
of a multiply and an add, so the total labor is of order 12n 3 to 20n 3 operations. 
Calculation of the eigenvectors as well as the eigenvalues changes the operation 
count from An to 6n per rotation, which is only a 50 percent overhead. 

#include <math.h> 

#include "nrutil.h" 

#define R0TATE(a,i, j ,k,l) g=a[i] [j] ;h=a[k] [1] ;a[i] [j] =g-s*(h+g*tau) ;\ 
a[k] [1]=h+s*(g-h*tau); 

void jacobi(float **a, int n, float d[], float **v, int *nrot) 

Computes all eigenvalues and eigenvectors of a real symmetric matrix a[l. .n] [1. .n] . On 
output, elements of a above the diagonal are destroyed. d[l. .n] returns the eigenvalues of a. 
v[l. .n] [1. .n] is a matrix whose columns contain, on output, the normalized eigenvectors of 
a. nrot returns the number of Jacobi rotations that were required. 

f 

int j,iq,ip,i; 

float tresh,theta,tau, t, sm, s, h, g, c, *b, *z; 

b=vector(l,n); 
z=vector(l,n); 

for (ip=i;ip<=n;ip++) { Initialize to the identity matrix, 

for (iq=l;iq<=n;iq++) v[ip][iq]=0.0; 
v[ip] [ip] =1.0; 

} 

for (ip=l;ip<=n;ip++) { 
b [ip] =d [ip] =a [ip] [ip] ; 
z[ip]=0.0; 

} 

*nrot=0; 

for (i=l;i<=50;i++) { 
sm=0.0; 

for (ip=l;ip<=n-l;ip++) { 
for (iq=ip+l;iq<=n;iq++) 
sm += f abs (a[ip] [iq]); 

} 

if (sm == 0.0) { 

free_vector(z,l,n); 
free_vector(b,l,n); 
return; 

} 

if (i < 4) 

tresh=0.2*sm/(n*n); 
else 

tresh=0.0; 

for (ip=l;ip<=n-l;ip++) { 

for (iq=ip+l;iq<=n;iq++) { 
g=100.0*fabs(a[ip] [iq]) ; 

After four sweeps, skip the rotation if the off-diagonal element is small, 
if (i > 4 && (float)(fabs(d[ip])+g) == (float)fabs(d[ip]) 
kk (float) (fabs(d[iq] )+g) == (f loat)fabs(d[iq])) 
a [ip] [iq] =0.0; 


Initialize b and d to the diagonal 
of a. 

This vector will accumulate terms 
of the form ta pq as in equa¬ 
tion (11.1.14). 

Sum off-diagonal elements. 


The normal return, which relies 
on quadratic convergence to 
machine underflow. 


...on the first three sweeps. 
...thereafter. 
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> 


> 


else if (fabs(a[ip] [iq]) > tresh) { 
h=d [iq] -d [ip] ; 

if ((float)(fabs(h)+g) == (float)fabs(h)) 
t=(a[ip] [iq])/h; t = 1/(20) 

else { 

theta=0.5*h/(a[ip] [iq]); Equation (11.1.10). 

t=l.0/(fabs(theta)+sqrt(1.0+theta*theta)); 
if (theta < 0.0) t = -t; 

> 

c=l.0/sqrt(l+t*t); 
s=t*c; 

tau=s/(1.0+c); 
h=t*a[ip] [iq] ; 
z [ip] -= h; 
z [iq] += h; 
d[ip] -= h; 
d[iq] += h; 
a [ip] [iq] =0.0; 
for (j=l;j<=ip-l;j++) { 

R0TATE(a,j,ip,j,iq) 

> 

for (j=ip+l;j<=iq-l;j++) { 

R0TATE(a,ip,j,],iq) 

> 

for (j=iq+l;j<=n;j++) { 

R0TATE(a,ip,j,iq,j) 

> 

for (j=l;j<=n;j++) { 

ROTATE(v,j,ip,],iq) 

> 

++(*nrot); 

> 

> 

> 

for (ip=l;ip<=n;ip++) { 
b [ip] += z [ip] ; 
d[ip]=b[ip] ; 
z[ip]=0.0; 

> 


Case of rotations 1 < j < p. 
Case of rotations p < j < q. 
Case of rotations q < j < n. 


Update d with the sum of ta pq 
and reinitialize z. 


nrerrorC'Too many iterations in routine jacobi"); 


Note that the above routine assumes that underflows are set to zero. On 
machines where this is not true, the program must be modified. 

The eigenvalues are not ordered on output. If sorting is desired, the following 
routine can be invoked to reorder the output of jacobi or of later routines in this 
chapter. (The method, straight insertion, is N 2 rather than N log N; but since you 
have just done an TV 3 procedure to get the eigenvalues, you can afford yourself 
this little indulgence.) 

void eigsrt(float d[] , float **v, int n) 

Given the eigenvalues d[l. .n] and eigenvectors v[l. .n] [1. .n] as output from jacobi 
(§11.1) or tqli (§11.3), this routine sorts the eigenvalues into descending order, and rearranges 
the columns of v correspondingly. The method is straight insertion. 

I 

int k,j,i; 
float p; 

for (i=l;i<n;i++) { 
p=d [k=i] ; 
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for (j=i+l;j<=n;j++) 

if (d[j] >= p) p=d[k=j] ; 
if (k ! = i) { 
d[k] =d[i] ; 
d[i]=p; 

for (J=1;j<=n;j++) { 
p=v[j] [i] ; 
v [j ] [i] =v [j ] [k] ; 
v[j] [k]=p; 

> 

> 

> 

> 


CITED REFERENCES AND FURTHER READING: 

Golub, G.H., and Van Loan, C.F. 1989, Matrix Computations, 2nd ed. (Baltimore: Johns Hopkins 
University Press), §8.4. 

Smith, B.T., et al. 1976, Matrix Eigensystem Routines — EISPACK Guide, 2nd ed., vol. 6 of 
Lecture Notes in Computer Science (New York: Springer-Verlag). [1] 

Wilkinson, J.H., and Reinsch, C. 1971, Linear Algebra, vol. II of Handbook for Automatic Com¬ 
putation (New York: Springer-Verlag). [2] 



11.2 Reduction of a Symmetric Matrix 
to Tridiagonal Form: Givens and 
Householder Reductions 

As already mentioned, the optimum strategy for finding eigenvalues and 
eigenvectors is, first, to reduce the matrix to a simple form, only then beginning an 
iterative procedure. For symmetric matrices, the preferred simple form is tridiagonal. 
The Givens reduction is a modification of the Jacobi method. Instead of trying to 
reduce the matrix all the way to diagonal form, we are content to stop when the 
matrix is tridiagonal. This allows the procedure to be carried out in a finite number 
of steps, unlike the Jacobi method, which requires iteration to convergence. 

Givens Method 

For the Givens method, we choose the rotation angle in equation (11.1.1) so 
as to zero an element that is not at one of the four “corners,” i.e., not a pp , a pq , 
or a qq in equation (11.1.3). Specifically, we first choose P 23 to annihilate a 3-1 
(and, by symmetry, 013 ). Then we choose P 24 to annihilate 041 . In general, we 
choose the sequence 



I*23 5 P24, • • • ,P 2 n;P 3 4, ■ • ■ ,P3n! Pn-l,n 

where Pfk annihilates afcj-i. The method works because elements such as a' rp and 
a' rq , with r ^ pr ^ q, are linear combinations of the old quantities a rp and a rq , by 
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equation (11.1.4). Thus, if a rp and a rq have already been set to zero, they remain 
zero as the reduction proceeds. Evidently, of order n 2 /2 rotations are required, 
and the number of multiplications in a straightforward implementation is of order 
4n 3 /3, not counting those for keeping track of the product of the transformation 
matrices, required for the eigenvectors. 

The Householder method, to be discussed next, is just as stable as the Givens 
reduction and it is a factor of 2 more efficient, so the Givens method is not generally 
used. Recent work (see [1 ]) has shown that the Givens reduction can be reformulated 
to reduce the number of operations by a factor of 2, and also avoid the necessity 
of taking square roots. This appears to make the algorithm competitive with the 
Householder reduction. However, this “fast Givens” reduction has to be monitored 
to avoid overflows, and the variables have to be periodically rescaled. There does 
not seem to be any compelling reason to prefer the Givens reduction over the 
Householder method. 

Householder Method 

The Householder algorithm reduces annxn symmetric matrix A to tridiagonal 
form by n — 2 orthogonal transformations. Each transformation annihilates the 
required part of a whole column and whole corresponding row. The basic ingredient 
is a Householder matrix P, which has the form 

P = 1 — 2w • w T (11.2.1) 

where w is a real vector with |w| 2 = 1. (In the present notation, the outer or matrix 
product of two vectors, a and b is written a • b T , while the inner or scalar product of 
the vectors is written as a T • b.) The matrix P is orthogonal, because 


P 2 = (1 - 2w • w T ) ■ (1 - 2w ■ w T ) 

= 1 — 4w • w T + 4w • (w T • w) • w T (11.2.2) 

= 1 


Therefore P = P 1 . But P T = P, and so P T = P ', proving orthogonality. 
Rewrite P as 


where the scalar H is 


P=1 


u u 


T 


H 



(11.2.3) 

(11.2.4) 


and u can now be any vector. Suppose x is the vector composed of the first column 
of A. Choose 



u = X =F |x|ei 


(11.2.5) 
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where ei is the unit vector [1,0,... ,0] T , and the choice of signs will be made 
later. Then 


Px = x-|.(xnx|e,) T -x 
= x _ 2u • (|x| 2 =F |x|a:i) 

2|x| 2 =f 2|x|a?i (11.2.6) 

= x — u 
= ±|x|ei 

This shows that the Householder matrix P acts on a given vector x to zero all its 
elements except the first one. 


To reduce a symmetric matrix A to tridiagonal form, we choose the vector x 
for the first Householder matrix to be the lower n—l elements of the first column. 
Then the lower n — 2 elements will be zeroed: 


1 

0 o ••• o' 


an 

<212 <213 ’ ’ ’ <2ln 

o - 



<221 


0 



<231 



& i) P] 



irrelevant 

0 



O-nl 



Oil 

Oi2 ai3 • • • di n 

k 


0 

irrelevant 

0 



Here we have written the matrices in partitioned form, with denoting a 

Householder matrix with dimensions (n — l) x (n—l). The quantity k is simply 
plus or minus the magnitude of the vector [ 02 I) •.., o„i] T . 

The complete orthogonal transformation is now 



<211 

k 0 ••• 0 


k 


A' = P A P = 

0 

irrelevant 


0 




We have used the fact that P T 


= P 
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Now choose the vector x for the second Householder matrix to be the bottom 
n — 2 elements of the second column, and from it construct 


1 

0 

0 


0 

0 

1 

0 


0 

0 

0 



(11.2.9) 




(n— 2 )p 2 


0 

0 





The identity block in the upper left corner insures that the tridiagonalization achieved 
in the first step will not be spoiled by this one, while the (n — 2)-dimensional 
Householder matrix ( n ~ 2 )p 2 creates one additional row and column of the tridiagonal 
output. Clearly, a sequence of n — 2 such transformations will reduce the matrix 
A to tridiagonal form. 

Instead of actually carrying out the matrix multiplications in P A P, we 
compute a vector 

(H.2.10) 

Then 

A. p = A.(l-^)= A -p.u- 

A'=P A P = A p u T -u p T + 2Au u T 
where the scalar K is defined by 

(11.2.11) 

(11.2.12) 

(11.2.13) 

This is the computationally useful formula. 

Following [2], the routine for Householder reduction given below actually starts 
in the nth column of A, not the first as in the explanation above. In detail, the 
equations are as follows: At stage m (to = 1,2,..., n — 2) the vector u has the form 

u T = [aji,a, 2 , • • • <H,i -1 ± V&, 0,... ,0] (11.2.14) 

Here 

i = n — to + 1 = n, n — 1,..., 3 (11.2.15) 

and the quantity a {\x\ 2 in our earlier notation) is 

<7 = (flji) 2 + • • • + (Ui,i-l) 2 (11.2.16) 

We choose the sign of cr in (11.2.14) to be the same as the sign of a 
roundoff error. 




i to lessen 
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Variables are thus computed in the following order: a.u. H, p. K . q, A '. At any 
stage to, A is tridiagonal in its last to — 1 rows and columns. 

If the eigenvectors of the final tridiagonal matrix are found (for example, by the 
routine in the next section), then the eigenvectors of A can be obtained by applying 
the accumulated transformation 

Q = Pi P 2 P«- 2 (H.2.17) 

to those eigenvectors. We therefore form Q by recursion after all the P’s have 
been determined: 

Qn—2 = Pn-2 

Q r p j'Qj«i j = n-3,...,l (11.2.18) 

Q = Qi 

Input for the routine below is the real, symmetric matrix a [1. . n] [1. . n]. On 
output, a contains the elements of the orthogonal matrix q. The vector d [1. . n] is 
set to the diagonal elements of the tridiagonal matrix A', while the vector e [1. . n] 
is set to the off-diagonal elements in its components 2 through n, with e[l]=0. 
Note that since a is overwritten, you should copy it before calling the routine, if it 
is required for subsequent computations. 

No extra storage arrays are needed for the intermediate results. At stage to, the 
vectors p and q are nonzero only in elements 1 (recall that i = n — m + 1), 
while u is nonzero only in elements 1,..., i — 1. The elements of the vector e are 
being determined in the order n, n — 1,..., so we can store p in the elements of e 
not already determined. The vector q can overwrite p once p is no longer needed. 
We store u in the zth row of a and u/H in the ith column of a. Once the reduction 
is complete, we compute the matrices Q ( using the quantities u and u/H that have 
been stored in a. Since Q y is an identity matrix in the last n — j + 1 rows and 
columns, we only need compute its elements up to row and column n — j. These 
can overwrite the u’s and u/H’s in the corresponding rows and columns of a, which 
are no longer required for subsequent Q’s. 

The routine tr ed2, given below, includes one further refinement. If the quantity 
a is zero or “small” at any stage, one can skip the corresponding transformation. 
A simple criterion, such as 

smallest positive number representable on machine 
machine precision 

would be fine most of the time. A more careful criterion is actually used. Define 
the quantity 


e = ^2\d ik \ 

fc=i 


(11.2.19) 



If e = 0 to machine precision, we skip the transformation. Otherwise we redefine 
a ik becomes a ik /e (11.2.20) 
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and use the scaled variables for the transformation. (A Householder transformation 
depends only on the ratios of the elements.) 

Note that when dealing with a matrix whose elements vary over many orders of 
magnitude, it is important that the matrix be permuted, insofar as possible, so that 
the smaller elements are in the top left-hand corner. This is because the reduction 
is performed starting from the bottom right-hand corner, and a mixture of small and 
large elements there can lead to considerable rounding errors. 

The routine tred2 is designed for use with the routine tqli of the next section, 
tqli finds the eigenvalues and eigenvectors of a symmetric, tridiagonal matrix. 
The combination of tred2 and tqli is the most efficient known technique for 
finding all the eigenvalues and eigenvectors (or just all the eigenvalues) of a real, 
symmetric matrix. 

In the listing below, the statements indicated by comments are required only for 
subsequent computation of eigenvectors. If only eigenvalues are required, omission 
of the commented statements speeds up the execution time of tred2 by a factor of 2 
for large n. In the limit of large n, the operation count of the Householder reduction 
is 2n 3 /3 for eigenvalues only, and 4n 3 /3 for both eigenvalues and eigenvectors. 

#include <math.h> 

void tred2(float **a, int n, float d[], float e[]) 

Householder reduction of a real, symmetric matrix a[l. .n] [1. .n] . On output, a is replaced 
by the orthogonal matrix Q effecting the transformation. d[l. .n] returns the diagonal ele¬ 
ments of the tridiagonal matrix, and e[l. .n] the off-diagonal elements, with e[l]=0. Several 
statements, as noted in comments, can be omitted if only eigenvalues are to be found, in which 
case a contains no useful information on output. Otherwise they are to be included. 

{ 

int 1,k,j,i; 

float scale,hh,h,g,f; 

for (i=n;i>=2;i—) { 



if (1 > 1) { 

for (k=l ;k<=l;k++) 

scale += fabs(a[i] [k]); 

if (scale == 0.0) Skip transformation. 

e[i]=a[i] [1] ; 
else { 

for (k=l;k<=l;k++) { 

a[i] [k] /= scale; Use scaled as for transformation, 

h += a[i] [k]*a[i] [k] ; Form <7 in h. 

> 

f=a[i] [1]; 

g=(f >= 0.0 ? -sqrt(h) : sqrt(h)); 
e [i]=scale*g; 

h -= f*g; Now h is equation (11.2.4). 

a[i] [l]=f-g; Store u in the ith row of a. 

f=0.0; 

for (j=l;j<=l;j++) { 

/* Next statement can be omitted if eigenvectors not wanted */ 
a[j] [i]=a[i] [j]/h; Store n/H in ith column of a. 

g=0.0; Form an element of A-u in g. 

for (k=l;k<=j;k++) 

g += a[j] [k] *a[i] [k] ; 
for (k=j+l;k<=l;k++) 

g += a[k] [j] *a[i] [k] ; 

e[j]=g/h; Form element of p in temporarily unused 

element of e. 
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f += e[j]*a[i] [j] ; 

> 

hh=f/(h+h); Form K, equation (11.2.11). 

for (j=i; j<=l; j++) { Form q and store in e overwriting p. 

f=a[i] [j] ; 
e[j]=g=e [j]-hh*f; 

for (k=l;k<=j;k++) Reduce a, equation (11.2.13). 

a[j] [k] -= (f*e[k] +g*a[i] [k] ) ; 


> else 

e [i]=a[i] [1] ; 
d[i]=h; 

> 

/* Next statement can be omitted if eigenvectors not wanted */ 
d[l]=0.0; 
e[1] =0.0; 

/* Contents of this loop can be omitted if eigenvectors not 
wanted except for statement d[i]=a[i] [i] ; */ 
for (i=l;i<=n;i++) { Begin accumulation of transformation ma- 

l=i-l; trices, 

if (d[i]) { This block skipped when i=l. 

for (j=l;j<=1;j++) { 

g=0.0; 

for (k=l;k<=l;k++| Use u and n/H stored in a to form PQ. 

g += a[i] [k] *a[k] [j] ; 
for (k=l ;k<=l;k++) 

a[k] [j] -= g*a[k] [i] ; 

> 

> 

d[i]=a[i] [i] 
a[i] [i] =1.0; 
for (j=l;j<= 

> 

> 

CITED REFERENCES AND FURTHER READING: 

Golub, G.H., and Van Loan, C.F. 1989, Matrix Computations, 2nd ed. (Baltimore: Johns Hopkins 
University Press), §5.1. [1] 

Smith, B.T., et al. 1976, Matrix Eigensystem Routines — EISPACK Guide, 2nd ed., vol. 6 of 
Lecture Notes in Computer Science (New York: Springer-Verlag). 

Wilkinson, J.H., and Reinsch, C. 1971, Linear Algebra, vol. II of Handbook for Automatic Com¬ 
putation (New York: Springer-Verlag). [2] 


; This statement remains. 

Reset row and column of a to identity 
=l;j++) a[j] [i]=a[i] [j] =0.0; matrix for next iteration. 



11.3 Eigenvalues and Eigenvectors of a 
Tridiagonal Matrix 

Evaluation of the Characteristic Polynomial 



Once our original, real, symmetric matrix has been reduced to tridiagonal form, 
one possible way to determine its eigenvalues is to bnd the roots of the characteristic 
polynomial p n ( A) directly. The characteristic polynomial of a tridiagonal matrix can 
be evaluated for any trial value of A by an efficient recursion relation (see [1 ], for 
example). The polynomials of lower degree produced during the recurrence form a 
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Sturmian sequence that can be used to localize the eigenvalues to intervals on the 
real axis. A root-finding method such as bisection or Newton’s method can then 
be employed to refine the intervals. The corresponding eigenvectors can then be 
found by inverse iteration (see §11.7). 

Procedures based on these ideas can be found in [2,3]. If, however, more 
than a small fraction of all the eigenvalues and eigenvectors are required, then the 
factorization method next considered is much more efficient. 

The QR and QL Algorithms 

The basic idea behind the QR algorithm is that any real matrix can be 
decomposed in the form 

A = Q R (11.3.1) 



where Q is orthogonal and R is upper triangular. For a general matrix, the 
decomposition is constructed by applying Householder transformations to annihilate 
successive columns of A below the diagonal (see §2.10). 

Now consider the matrix formed by writing the factors in (11.3.1) in the 
opposite order: 

A' = R Q (11.3.2) 



Since Q is orthogonal, equation (11.3.1) gives R = Q T A. Thus equation (11.3.2) 
becomes 



A' = Q t A Q 


We see that A' is an orthogonal transformation of A. 

You can verify that a QR transformation preserves the following properties of 
a matrix: symmetry, tridiagonal form, and Hessenberg form (to be defined in § 11.5). 

There is nothing special about choosing one of the factors of A to be upper 
triangular; one could equally well make it lower triangular. This is called the QL 
algorithm, since 

A = Q • L (11.3.4) 

where L is lower triangular. (The standard, but confusing, nomenclature R and L 
stands for whether the right or left of the matrix is nonzero.) 

Recall that in the Householder reduction to tridiagonal form in § 11.2, we started 
in the nth (last) column of the original matrix. To minimize roundoff, we then 
exhorted you to put the biggest elements of the matrix in the lower right-hand 
corner, if you can. If we now wish to diagonalize the resulting tridiagonal matrix, 
the QL algorithm will have smaller roundoff than the QR algorithm, so we shall 
use QL henceforth. 
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The QL algorithm consists of a sequence of orthogonal transformations: 

A s = Q, L s 

A.,: i = L s Q s (= Q! A. Q J ! ' 


The following (nonobvious!) theorem is the basis of the algorithm for a general 
matrix A: (i) If A has eigenvalues of different absolute value A |, then A s —> [lower 
triangular form] as s —> oo. The eigenvalues appear on the diagonal in increasing 
order of absolute magnitude, (ii) If A has an eigenvalue |A»| of multiplicity p, 
A s —> [lower triangular form] as s —* oo, except for a diagonal block matrix 
of order p, whose eigenvalues —> Aj. The proof of this theorem is fairly lengthy; 
see, for example, [4], 

The workload in the QL algorithm is 0(n 3 ) per iteration for a general matrix, 
which is prohibitive. However, the workload is only 0(n) per iteration for a 
tridiagonal matrix and 0(n 2 ) for a Hessenberg matrix, which makes it highly 
efficient on these forms. 

In this section we are concerned only with the case where A is a real, symmetric, 
tridiagonal matrix. All the eigenvalues A j are thus real. According to the theorem, 
if any Aj has a multiplicity p, then there must be at least p — 1 zeros on the 
sub- and superdiagonal. Thus the matrix can be split into submatrices that can be 
diagonalized separately, and the complication of diagonal blocks that can arise in 
the general case is irrelevant. 

In the proof of the theorem quoted above, one finds that in general a super¬ 
diagonal element converges to zero like 



Although Aj < Aj, convergence can be slow if Aj is close to Aj. Convergence can 
be accelerated by the technique of shifting : If k is any constant, then A — kl has 
eigenvalues Aj — k. If we decompose 



so that 


A s - k s 1 = Q s L s 
A s+ i = L s • Q s + k s 1 
= QT A s Q s 






The idea is to choose the shift k s at each stage to maximize the rate of 
convergence. A good choice for the shift initially would be k s close to Ai, the 
smallest eigenvalue. Then the first row of off-diagonal elements would tend rapidly 
to zero. However, Ai is not usually known a priori. A very effective strategy in 
practice (although there is no proof that it is optimal) is to compute the eigenvalues 
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of the leading 2x2 diagonal submatrix of A. Then set k s equal to the eigenvalue 
closer to an. 

More generally, suppose you have already found r — 1 eigenvalues of A. Then 
you can deflate the matrix by crossing out the first r — 1 rows and columns, leaving 


’0 . 0 ' 

0 

: d r e r : 

: e r d r+ 1 

0 

d n -l e„_i 

.0 • • • 0 e„_i d n 


(11.3.10) 


Choose k s equal to the eigenvalue of the leading 2x2 submatrix that is closer to d r . 
One can show that the convergence of the algorithm with this strategy is generally 
cubic (and at worst quadratic for degenerate eigenvalues). This rapid convergence 
is what makes the algorithm so attractive. 

Note that with shifting, the eigenvalues no longer necessarily appear on the 
diagonal in order of increasing absolute magnitude. The routine eigsrt (§11.1) 
can be used if required. 

As we mentioned earlier, the QL decomposition of a general matrix is effected 
by a sequence of Householder transformations. For a tridiagonal matrix, however, it is 
more efficient to use plane rotations P pq . One uses the sequence P 12, P23, • • ■ ,P n -i,n 
to annihilate the elements ai 2 , 023 ,..., a„_i in . By symmetry, the subdiagonal 
elements a 2 i, 032 ,...,a ri , n _i will be annihilated too. Thus each Q s is a product 
of plane rotations: 


Qj = pM(11.3.11) 

where Pi annihilates a^i+ Note that it is Q T in equation (11.3.11), not Q, because 
we defined L = Q r • A. 

QL Algorithm with Implicit Shifts 

The algorithm as described so far can be very successful. However, when 
the elements of A differ widely in order of magnitude, subtracting a large k s 
from the diagonal elements can lead to loss of accuracy for the small eigenvalues. 
This difficulty is avoided by the QL algorithm with implicit shifts. The implicit 
QL algorithm is mathematically equivalent to the original QL algorithm, but the 
computation does not require k s 1 to be actually subtracted from A. 

The algorithm is based on the following lemma: If A is a symmetric nonsingular matrix 
and B = Q T • A • Q, where Q is orthogonal and B is tridiagonal with positive off-diagonal 
elements, then Q and B are fully determined when the last row of Q T is specified. Proof: 
Let q[ denote the ith row vector of the matrix Q T . Then q, is the ith column vector of the 
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matrix Q. The relation B ■ Q T = Q T • A can be written 


•A (11.3.12) 


The nth row of this matrix equation is 

anq^-i+A.qXNf-A (11.3.13) 

Since Q is orthogonal, 

q^.q m =Snm (11.3.14) 

Thus if we postmultiply equation (11.3.13) by q n , we find 

0 n = A q„ (11.3.15) 

which is known since q n is known. Then equation (11.3.13) gives 

a n q„-i = %n-i (11.3.16) 

where 

Zn -1 = Qn • A - p n ql (11.3.17) 

is known. Therefore 

al = zl_ 1 z n -i, (11.3.18) 

or 

a„ = |z n — 1 1 (11.3.19) 

and 

qLl=2n-l/an (11.3.20) 

(where a n is nonzero by hypothesis). Similarly, one can show by induction that if we know 
q„,q„_i,..., q r ,_j and the a’s, /3’s, and 7 ’s up to level n — j, one can determine the 
quantities at level n — (J + 1 ). 

To apply the lemma in practice, suppose one can somehow find a tridiagonal matrix 
A a+ i such that 



A s+1 = <£ • A. • Q„ 


(11.3.21) 


where Qj is orthogonal and has the same last row as Qf in the original QL algorithm. 
Then Q s = Q s and A s+ i = A s+ i. 

Now, in the original algorithm, from equation (11.3.11) we see that the last row of Q( 
is the same as the last row of But recall that P.^1 1 is a plane rotation designed to 

annihilate the (n — 1, n) element of A s — k s 1. A simple calculation using the expression 
(11.1.1) shows that it has parameters 

(,,.3.22) 

y/el + (dn - ks ) 2 y/el + (dn - ks ) 2 

The matrix P^li • A„ • P^T is tridiagonal with 2 extra elements: 

xxx 

x x x x 
xxx 
xxx 

We must now reduce this to tridiagonal form with an orthogonal matrix whose last row is 
[0,0,..., 0,1] so that the last row of Q J will stay equal to P^j. This can be done by 


(11.3.23) 
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a sequence of Householder or Givens transformations. For the special form of the matrix 
(11.3.23), Givens is better. We rotate in the plane (n — 2, n — 1) to annihilate the (n — 2, n) 
element. [By symmetry, the (n, n — 2) element will also be zeroed.] This leaves us with 
tridiagonal form except for extra elements (n — 3, n — 1) and (n — 1, n — 3). We annihilate 
these with a rotation in the (n — 3, n — 2) plane, and so on. Thus a sequence of n — 2 
Givens rotations is required. The result is that 

Qf = Qr = Pi s) • P2 S) • • • Pn-2 • Pi s ii (11.3.24) 

where the P’s are the Givens rotations and P r , i is the same plane rotation as in the original 
algorithm. Then equation (11.3.21) gives the next iterate of A. Note that the shift k s enters 
implicitly through the parameters (11.3.22). 

The following routine tqli (“Tridiagonal QL Implicit”), based algorithmically 
on the implementations in [2,3], works extremely well in practice. The number of 
iterations for the first few eigenvalues might be 4 or 5, say, but meanwhile the 
off-diagonal elements in the lower right-hand corner have been reduced too. The 
later eigenvalues are liberated with very little work. The average number of iterations 
per eigenvalue is typically 1.3 — 1.6. The operation count per iteration is 0(ri), 
with a fairly large effective coefficient, say, ~ 20n. The total operation count for 
the diagonalization is then ~ 20n x (1.3 — 1.6)n ~ 30n 2 . If the eigenvectors 
are required, the statements indicated by comments are included and there is an 
additional, much larger, workload of about 3 n 3 operations. 

#include <math.h> 

#include "nrutil.h" 

void tqli(float d[] , float e[], int n, float **z) 

QL algorithm with implicit shifts, to determine the eigenvalues and eigenvectors of a real, sym¬ 
metric, tridiagonal matrix, or of a real, symmetric matrix previously reduced by tred2 §11.2. On 
input, d[l. .n] contains the diagonal elements of the tridiagonal matrix. On output, it returns 
the eigenvalues. The vector e [1. .n] inputs the subdiagonal elements of the tridiagonal matrix, 
with e [1] arbitrary. On output e is destroyed. When finding only the eigenvalues, several lines 
may be omitted, as noted in the comments. If the eigenvectors of a tridiagonal matrix are de¬ 
sired, the matrix z [1. .n] [l..n] is input as the identity matrix. If the eigenvectors of a matrix 
that has been reduced by tred2 are required, then z is input as the matrix output by tred2. 
In either case, the kth column of z returns the normalized eigenvector corresponding to d[k] . 
1 

float pythagffloat a, float b); 
int m,l,iter,i,k; 
float s,r,p,g,f,dd,c,b; 

for (i=2;i<=n;i++) e[i-l]=e[i]; Convenient to renumber the el- 

e [n] =0 .0; ements of e. 

for (1=1;l<=n;l++) { 
iter=0; 
do { 

for (m=l;m<=n-l;m++) { Look for a single small subdi- 

dd=fabs(d[m] )+fabs(d[m+l] ); agonal element to split 

if ( (float) (fabs(e [m] )+dd) == dd) break; the matrix. 

> 

if (m != 1) { 

if (iter++ == 30) nrerror("Too many iterations in tqli"); 
g=(d[l+l]-d[l] )/(2.0*e [1]); Form shift. 

r=pythag(g,1.0); 

g=d[m] -d[l] +e [1] /(g+SIGN(r ,g) ); This is d m — k a - 

s=c=l.0; 

p=0.0; 
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> 

> 


for (i=m-l;i>=l;i—) { 
f=s*e [i] ; 
b=c*e[i] ; 

e[i+l] = (r=pythag(f ,g)); 
if (r == 0.0) { 
d[i+l] -= p; 
e [m] =0.0; 
break; 

> 

s=f/r; 
c=g/r; 
g=d[i+l]-p; 
r=(d[i]-g)*s+2.0*c*b; 
d[i+l]=g+(p=s*r); 
g=c*r-b; 

/* Next loop can be omitted 
for (k=l;k<=n;k++) { 
f=z [k] [i+1] ; 

z[k] [i+l]=s*z[k] [i]+c*f; 
z[k] [i] =c*z [k] [i] -s*f; 


A plane rotation as in the origi¬ 
nal QL, followed by Givens 
rotations to restore tridiag¬ 
onal form. 

Recover from underflow. 


if eigenvectors not wanted*/ 
Form eigenvectors. 


> 

if (r == 0.0 && i >= 1) continue; 
d[l] -= p; 
e[l]=g; 
e [m] =0.0; 

> 

> while (m != 1); 


CITED REFERENCES AND FURTHER READING: 
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Wilkinson, J.H., and Reinsch, C. 1971, Linear Algebra, vol. II of Handbook for Automatic Com¬ 
putation (New York: Springer-Verlag). [2] 

Smith, B.T., et al. 1976, Matrix Eigensystem Routines — EISPACK Guide, 2nd ed., vol. 6 of 
Lecture Notes in Computer Science (New York: Springer-Verlag). [3] 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
§6.6.6. [4] 


11.4 Hermitian Matrices 

The complex analog of a real, symmetric matrix is a Hermitian matrix, 
satisfying equation (11.0.4). Jacobi transformations can be used to find eigenvalues 
and eigenvectors, as also can Householder reduction to tridiagonal form followed by 
QL iteration. Complex versions of the previous routines jacobi, tred2, and tqli 
are quite analogous to their real counterparts. For working routines, consult [1,2]. 

An alternative, using the routines in this book, is to convert the Hermitian 
problem to a real, symmetric one: If C = A + (B is a Hermitian matrix, then the 
n x n complex eigenvalue problem 



(A + (B) • (u + i\) = A(u + ix) 


(11.4.1) 
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is equivalent to the 2n x 2 n real problem 



(11.4.2) 


Note that the 2 n x 2 n matrix in (11.4.2) is symmetric: A T = A and B T = —B 
if C is Hermitian. 

Corresponding to a given eigenvalue A, the vector 


(11.4.3) 


is also an eigenvector, as you can verify by writing out the two matrix equa¬ 
tions implied by (11.4.2). Thus if Ai, A2, ..., X n are the eigenvalues of C, then 
the 2 n eigenvalues of the augmented problem (11.4.2) are Ai, Ai, A2, A2,..., 
A n , A n ; each, in other words, is repeated twice. The eigenvectors are pairs of the 
form u + i\ and i ( u + iv); that is, they are the same up to an inessential phase. Thus 
we solve the augmented problem (11.4.2), and choose one eigenvalue and eigenvector 
from each pair. These give the eigenvalues and eigenvectors of the original matrix C. 

Working with the augmented matrix requires a factor of 2 more storage than the 
original complex matrix. In principle, a complex algorithm is also a factor of 2 more 
efficient in computer time than is the solution of the augmented problem. 


CITED REFERENCES AND FURTHER READING: 

Wilkinson, J.H., and Reinsch, C. 1971, Linear Algebra, vol. II of Handbook for Automatic Com¬ 
putation (New York: Springer-Verlag). [1] 

Smith, B.T., et al. 1976, Matrix Eigensystem Routines — EISPACK Guide, 2nd ed., vol. 6 of 
Lecture Notes in Computer Science (New York: Springer-Verlag). [2] 


11.5 Reduction of a General Matrix to 
Hessenberg Form 

The algorithms for symmetric matrices, given in the preceding sections, are 
highly satisfactory in practice. By contrast, it is impossible to design equally 
satisfactory algorithms for the nonsymmetric case. There are two reasons for this. 
First, the eigenvalues of a nonsymmetric matrix can be very sensitive to small changes 
in the matrix elements. Second, the matrix itself can be defective, so that there is 
no complete set of eigenvectors. We emphasize that these difficulties are intrinsic 
properties of certain nonsymmetric matrices, and no numerical procedure can “cure” 
them. The best we can hope for are procedures that don’t exacerbate such problems. 

The presence of rounding error can only make the situation worse. With finite- 
precision arithmetic, one cannot even design a foolproof algorithm to determine 
whether a given matrix is defective or not. Thus current algorithms generally try to 
find a complete set of eigenvectors, and rely on the user to inspect the results. If any 
eigenvectors are almost parallel, the matrix is probably defective. 
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Apart from referring you to the literature, and to the collected routines in [1,2], we 
are going to sidestep the problem of eigenvectors, giving algorithms for eigenvalues 
only. If you require just a few eigenvectors, you can read §11.7 and consider finding 
them by inverse iteration. We consider the problem of finding all eigenvectors of a 
nonsymmetric matrix as lying beyond the scope of this book. 

Balancing 

The sensitivity of eigenvalues to rounding errors during the execution of 
some algorithms can be reduced by the procedure of balancing. The errors in 
the eigensystem found by a numerical procedure are generally proportional to the 
Euclidean norm of the matrix, that is, to the square root of the sum of the squares 
of the elements. The idea of balancing is to use similarity transformations to 
make corresponding rows and columns of the matrix have comparable norms, thus 
reducing the overall norm of the matrix while leaving the eigenvalues unchanged. 
A symmetric matrix is already balanced. 

Balancing is a procedure with of order N 2 operations. Thus, the time taken 
by the procedure balanc, given below, should never be more than a few percent 
of the total time required to find the eigenvalues. It is therefore recommended that 
you always balance nonsymmetric matrices. It never hurts, and it can substantially 
improve the accuracy of the eigenvalues computed for a badly balanced matrix. 

The actual algorithm used is due to Osborne, as discussed in [1 ]. It consists of a 
sequence of similarity transformations by diagonal matrices D. To avoid introducing 
rounding errors during the balancing process, the elements of D are restricted to be 
exact powers of the radix base employed for floating-point arithmetic (i.e., 2 for most 
machines, but 16 for IBM mainframe architectures). The output is a matrix that 
is balanced in the norm given by summing the absolute magnitudes of the matrix 
elements. This is more efficient than using the Euclidean norm, and equally effective: 
A large reduction in one norm implies a large reduction in the other. 

Note that if the off-diagonal elements of any row or column of a matrix are 
all zero, then the diagonal element is an eigenvalue. If the eigenvalue happens to 
be ill-conditioned (sensitive to small changes in the matrix elements), it will have 
relatively large errors when determined by the routine hqr (§11.6). Had we merely 
inspected the matrix beforehand, we could have determined the isolated eigenvalue 
exactly and then deleted the corresponding row and column from the matrix. You 
should consider whether such a pre-inspection might be useful in your application. 
(For symmetric matrices, the routines we gave will determine isolated eigenvalues 
accurately in all cases.) 

The routine balanc does not keep track of the accumulated similarity trans¬ 
formation of the original matrix, since we will only be concerned with finding 
eigenvalues of nonsymmetric matrices, not eigenvectors. Consult [1 -3] if you want 
to keep track of the transformation. 

#include <math.h> 

#define RADIX 2.0 

void balanc(float **a, int n) 

Given a matrix a[l. .n] [1. .n], this routine replaces it by a balanced matrix with identical 
eigenvalues. A symmetric matrix is already balanced and is unaffected by this procedure. The 
parameter RADIX should be the machine’s floating-point radix. 
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int last,j,i; 

float s,r,g,f,c,sqrdx; 

sqrdx=RADIX*RADIX; 
last=0; 

while (last == 0) { 
last=l; 

for (i=l;i<=n;i++) { Calculate row and column norms. 

r=c=0.0; 

for (j=l;j<=n;j++) 
if (j ! = i) { 

c += fabs(a[j] [i]); 
r += fabs(a[i] [j]); 

> 

if (c && r) { If both are nonzero, 

g=r/RADIX; 
f=1.0; 
s=c+r; 

while (c<g) { find the integer power of the machine radix that 

f *= RADIX; comes closest to balancing the matrix, 

c *= sqrdx; 

> 

g=r*RADIX; 
while (c>g) { 
f /= RADIX; 
c /= sqrdx; 

> 

if ((c+r)/f < 0.95*s) { 
last=0; 
g=1.0/f; 

for (j=l; j<=n; j++) a[i] [j] *= g; Apply similarity transforma- 
for (j=i; j<=n; j++) a[j] [i] *= f; tion. 


> 

> 

> 



Reduction to Hessenberg Form 

The strategy for finding the eigensystem of a general matrix parallels that of the 
symmetric case. First we reduce the matrix to a simpler form, and then we perform 
an iterative procedure on the simplified matrix. The simpler structure we use here is 
called Hessenberg form. An upper Hessenberg matrix has zeros everywhere below 
the diagonal except for the first subdiagonal row. For example, in the 6 x 6 case, 
the nonzero elements are: 

" x x x x x x" 

x x x x x x 

x x x x x 

x x x x 

xxx 
x x. 

By now you should be able to tell at a glance that such a structure can 
be achieved by a sequence of Householder transformations, each one zeroing the 
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required elements in a column of the matrix. Householder reduction to Hessenberg 
form is in fact an accepted technique. An alternative, however, is a procedure 
analogous to Gaussian elimination with pivoting. We will use this elimination 
procedure since it is about a factor of 2 more efficient than the Householder method, 
and also since we want to teach you the method. It is possible to construct matrices 
for which the Householder reduction, being orthogonal, is stable and elimination is 
not, but such matrices are extremely rare in practice. 

Straight Gaussian elimination is not a similarity transformation of the matrix. 
Accordingly, the actual elimination procedure used is slightly different. Before the 
rth stage, the original matrix A = Ai has become A r , which is upper Hessenberg 
in its first r — 1 rows and columns. The rth stage then consists of the following 
sequence of operations: 

• Find the element of maximum magnitude in the rth column below the 
diagonal. If it is zero, skip the next two “bullets” and the stage is done. 
Otherwise, suppose the maximum element was in row r'. 

• Interchange rows r' and r + 1. This is the pivoting procedure. To make 
the permutation a similarity transformation, also interchange columns r 1 
and r + 1. 

• For i = r + 2. r + 3., N, compute the multiplier 

dir 

n%, r +1 = - 

Or+l.r 

Subtract n^r+i times row r + 1 from row i. To make the elimination a 
similarity transformation, also add n vr+ i times column i to column r+1. 

A total of N — 2 such stages are required. 

When the magnitudes of the matrix elements vary over many orders, you should 
try to rearrange the matrix so that the largest elements are in the top left-hand comer. 
This reduces the roundoff error, since the reduction proceeds from left to right. 

Since we are concerned only with eigenvalues, the routine elmhes does not 
keep track of the accumulated similarity transformation. The operation count is 
about 57V 3 /6 for large N. 

#include <math.h> 

#define SWAP(g,h) {y=(g);(g)=(h);(h)=y;> 
void elmhes(float **a, int n) 

Reduction to Hessenberg form by the elimination method. The real, nonsymmetric matrix 
a[l. .n] [1. .n] is replaced by an upper Hessenberg matrix with identical eigenvalues. Rec¬ 
ommended, but not required, is that this routine be preceded by balanc. On output, the 
Hessenberg matrix is in elements a[i] [j] with i < j+1. Elements with i > j+1 are to be 
thought of as zero, but are returned with random values. 

I 

int m,j,i; 
float y,x; 

for (m=2;m<n;m++) { m is called r + 1 in the text. 

x=0.0; 

for (j=m;j<=n;j++) { Find the pivot, 

if (fabs(a[j] [m-1]) > fabs(x)) { 
x=a[j] [m-1] ; 

i=ji 
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> 

} 

if (i != m) { Interchange rows and columns, 

for (j=m-l; j<=n;j++) SWAP(a[i] [j] ,a[m] [j]) 
for (j=l; j<=n; j++) SWAP(a[j] [i] ,a[j] [m]) 

> 

if (x) { Carry out the elimination, 

for (i=m+l; i<=n;i++) { 

if ((y=a[i] [m-1]) != 0 .0) { 
y /= x; 
a[i] [m-l]=y; 
for (j=m;j<=n;j++) 

a[i] [j] -= y*a[m] [j] ; 
for (j=1;j<=n;j++) 

a[j] [m] += y*a[j] [i] ; 

> 

> 

> 

> 

> 


CITED REFERENCES AND FURTHER READING: 
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Smith, B.T., et al. 1976, Matrix Eigensystem Routines — EISPACK Guide, 2nd ed., vol. 6 of 
Lecture Notes in Computer Science (New York: Springer-Verlag). [2] 
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11.6 The QR Algorithm for Real Hessenberg 
Matrices 


Recall the following relations for the QR algorithm with shifts: 


Q s • (A s - k s l) = R s 


where Q is orthogonal and R is upper triangular, and 


A s+1 = R s Qj + k s l 
= Q S A S Qj 


( 11 . 6 . 1 ) 


( 11 . 6 . 2 ) 


The QR transformation preserves the upper Hessenberg form of the original matrix 
A = Ai, and the workload on such a matrix is 0(n 2 ) per iteration as opposed 
to 0(n 3 ) on a general matrix. As s —>■ oc, A s converges to a form where the 
eigenvalues are either isolated on the diagonal or are eigenvalues of a 2 x 2 submatrix 
on the diagonal. 

As we pointed out in §11.3, shifting is essential for rapid convergence. A key 
difference here is that a nonsymmetric real matrix can have complex eigenvalues. This 
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means that good choices for the shifts k s may be complex, apparently necessitating 
complex arithmetic. 

Complex arithmetic can be avoided, however, by a clever trick. The trick 
depends on a result analogous to the lemma we used for implicit shifts in § 11.3. The 
lemma we need here states that if B is a nonsingular matrix such that 


B Q = Q H (11.6.3) 

where Q is orthogonal and H is upper Hessenberg, then Q and H are fully determined 
by the first column of Q. (The determination is unique if H has positive subdiagonal 
elements.) The lemma can be proved by induction analogously to the proof given 
for tridiagonal matrices in §11.3. 

The lemma is used in practice by taking two steps of the QR algorithm, 
either with two real shifts k s and k s+ 1 , or with complex conjugate values k s and 
k s+ 1 = k s *. This gives a real matrix A s+2 , where 


As+2 = Qs+1 • Qs ' A s • Qj • Qf +1 - 

(11.6.4) 

The Q’s are determined by 


A s - k s l = Qf R s 

(11.6.5) 

A.s_, = Q s A a • Q( 

(11.6.6) 

A«+i k s - i-i 1 = Q s+ i • R s +i 

(11.6.7) 

Using (11.6.6), equation (11.6.7) can be rewritten 


A s -k s+1 l = Qj Qj +1 R s+1 Q s 

(11.6.8) 

Hence, if we define 


M = (A s — jfc a+ il) • (A fl - k s 1) 

(11.6.9) 

equations (11.6.5) and (11.6.8) give 


R = Q M 

(11.6.10) 

where 


Q = Q s +i Q s 

(11.6.11) 

R = Rs+i • Rs 

(11.6.12) 

Equation (11.6.4) can be rewritten 


As • Q T = Q T • A s+ 2 

(11.6.13) 


Thus suppose we can somehow find an upper Hessenberg matrix H such that 

A s -Q T = Q T -H (11.6.14) 

where Q is orthogonal. If Q T has the same first column as Q T (i.e., Q has the same 
first row as Q), then Q = Q and A s +2 = H. 



s o- i 
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The first row of Q is found as follows. Equation (11.6.10) shows that Q is 
the orthogonal matrix that triangularizes the real matrix M. Any real matrix can 
be triangularized by premultiplying it by a sequence of Householder matrices P i 
(acting on the first column), P 2 (acting on the second column), .... P„_i. Thus 
Q = P„_ i • • • P 2 Pi, and the first row of Q is the first row of Pi since P i is an 
(i — 1) x (i — 1) identity matrix in the top left-hand corner. We now must find Q 
satisfying (11.6.14) whose first row is that of Pi. 

The Householder matrix Pi is determined by the first column of M. Since A s 
is upper Hessenberg, equation (11.6.9) shows that the first column of M has the 
form lp 1 ,q 1 ,r 1 ,0,...,0] T , where 

Pi = «ii — a,u(k s + k s+1 ) + k s k s +i + ai 2 a 2 i 

qi = a 2 i(an + a 22 - k s - k s+ 1 ) (11.6.15) 

ri = a 2 ia3 2 


Hence 


Pi = 1 - 2wi • wf 


(11.6.16) 


where Wi has only its first 3 elements nonzero (cf. equation 11.2.5). The matrix 
Pi • A s • Pf is therefore upper Hessenberg with 3 extra elements: 


Pi A, P[ 


x 

x 


X 


X 

X 

X 

X 

X 

X 


X 


(11.6.17) 


This matrix can be restored to upper Hessenberg form without affecting the first row 
by a sequence of Householder similarity transformations. The first such Householder 
matrix, P 2 , acts on elements 2, 3, and 4 in the first column, annihilating elements 
3 and 4. This produces a matrix of the same form as (11.6.17), with the 3 extra 
elements appearing one column over: 


x x x x x x 

x x x x x x 


x x x x 
x x x x 


(11.6.18) 



Proceeding in this way up to P„_i, we see that at each stage the Householder 
matrix P, has a vector w r that is nonzero only in elements r, r + 1, and r + 2. 
These elements are determined by the elements r, r + 1, and r + 2 in the (j— l)st 


imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 




11.6 The QR Algorithm for Real Hessen berg Matrices 


489 


column of the current matrix. Note that the preliminary matrix P i has the same 
structure as P 2 ,...,P„_i. 

The result is that 

P„_i ■ ■ - P 2 ■ Pi ■ A a ■ Pf ■ P^ • • -P^l-, = H (11.6.19) 

where H is upper Hessenberg. Thus 

Q = Q = P„_i P 2 Pi (11.6.20) 

and 

A 5+2 =H (11.6.21) 

The shifts of origin at each stage are taken to be the eigenvalues of the 2 x 2 
matrix in the bottom right-hand comer of the current A s . This gives 


Substituting (11.6.22) in (11.6.15), we get 

Pi = 0-21 {[( o nn — an)(a„_i in _i — an) — o n -i tn a ntn -i]/a 2 i + ai 2 } 

qi = a 2 i[a 22 — an — (a nn — an) — (a„_ i,».-$. — an)] 

ft = a 2 ia. i2 (11.6.23) 


We have judiciously grouped terms to reduce possible roundoff when there are 
small off-diagonal elements. Since only the ratios of elements are relevant for a 
Householder transformation, we can omit the factor a 2 i from (11.6.23). 


In summary, to carry out a double QR step we construct the Householder 
matrices P r , r = l,...,n — 1. For Pi weusepi, q±, and r\ given by (11.6.23). For 
the remaining matrices, p r , q r , and?> are determined by the (r, r — 1), (r + 1, r — 1), 
and (r + 2,r — 1) elements of the current matrix. The number of arithmetic 
operations can be reduced by writing the nonzero elements of the 2w • w T part of 
the Householder matrix in the form 

r (p ± s )/( ±s Y 

2w-w t = <?/(±s) '[1 q/{p±s) r/(p±s)] (11.6.24) 

r/(±s) 

where 


s =p +q +r 



(We have simply divided each element by a piece of the normalizing factor; cf. 
the equations in §11.2.) 

If we proceed in this way, convergence is usually very fast. There are two 
possible ways of terminating the iteration for an eigenvalue. First, if a n ,n-i becomes 
“negligible,” then a nn is an eigenvalue. We can then delete the nth row and column 
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of the matrix and look for the next eigenvalue. Alternatively, a n -i , n -2 may become 
negligible. In this case the eigenvalues of the 2 x 2 matrix in the lower right-hand 
corner may be taken to be eigenvalues. We delete the nth and (n — l)st rows and 
columns of the matrix and continue. 

The test for convergence to an eigenvalue is combined with a test for negligible 
subdiagonal elements that allows splitting of the matrix into submatrices. We find the 
largest i such that a^i-i is negligible. If i = n, we have found a single eigenvalue. If 
i = n— 1, we have found two eigenvalues. Otherwise we continue the iteration on the 
submatrix in rows i to n (i being set to unity if there is no small subdiagonal element). 

After determining i, the submatrix in rows i to n is examined to see if the product 
of any two consecutive subdiagonal elements is small enough that we can work with 
an even smaller submatrix, starting say in row to. We start with m = n — 2 and 
decrement it down to i + 1, computing p, q, and r according to equations (11.6.23) 
with 1 replaced by m and 2 by m +1. If these were indeed the elements of the special 
“first” Householder matrix in a double QR step, then applying the Householder 
matrix would lead to nonzero elements in positions (m +1, m — 1), (to + 2, m — 1), 
and (to + 2, to). We require that the first two of these elements be small compared 
with the local diagonal elements o m _i, m _i, a mrn and « m +i.m+i• A satisfactory 
approximate criterion is 

i|(M + M) <€ M(|a m+ i, m+ i| + \a mm \ + |a m _i, m _i|) (11.6.26) 

Very rarely, the procedure described so far will fail to converge. On such 
matrices, experience shows that if one double step is performed with any shifts 
that are of order the norm of the matrix, convergence is subsequently very rapid. 
Accordingly, if ten iterations occur without determining an eigenvalue, the usual 
shifts are replaced for the next iteration by shifts defined by 

k s + fcs+i = 1-5 x (|a n>Tl _i| + |a„_i jn _ 2 |) 


The factor 1.5 was arbitrarily chosen to lessen the likelihood of an “unfortunate” 
choice of shifts. This strategy is repeated after 20 unsuccessful iterations. After 30 
unsuccessful iterations, the routine reports failure. 

The operation count for the QR algorithm described here is ~ 5k 2 per iteration, 
where k is the current size of the matrix. The typical average number of iterations per 
eigenvalue is ~ 1.8, so the total operation count for all the eigenvalues is ~ 3n 3 . This 
estimate neglects any possible efficiency due to splitting or sparseness of the matrix. 

The following routine hqr is based algorithmically on the above description, 
in turn following the implementations in [1,2]. 

#include <math.h> 

#include "nrutil.h" 

void hqr (float **a, int n, float wr[], float wi[]) 

Finds all eigenvalues of an upper Hessenberg matrix a[l. .n] [1. .n] . On input a can be 
exactly as output from elmhes §11.5; on output it is destroyed. The real and imaginary parts 
of the eigenvalues are returned in wr[l. .n] and wi[l. .n], respectively. 

{ 
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int nn,m,l,k, j ,its,i,mmin; 
float z,y,x,w,v,u,t,s,r,q,p,anorm; 

anorm=0.0; Compute matrix norm for possible use in lo- 

for (i=l;i<=n;i++) eating single small subdiagonal element, 

for (j=IMAX(i-l,i);j<=n;j++) 
anorm += fabs(a[i] [j]); 

im=n; 

t=0.0; Gets changed only by an exceptional shift, 

while (nn >= 1) { Begin search for next eigenvalue. 

its=0; 
do { 

for (l=nn;l>=2;l —) { Begin iteration: look for single small subdi- 

s=fabs(a[l-l] [1-1] )+fabs(a[l] [1]); agonal element, 

if (s == 0.0) s=anorm; 

if ((float) (fabs(a[l] [1-1]) + s) == s) { 
a[l] [1-1] =0.0; 


x=a[nn] [nn] ; 

if (1 == nn) { One root found, 

wr[nn]=x+t; 
wi[nn—]=0.0; 

> else { 

y=a[nn-l] [nn-1] ; 
w=a[nn] [nn-1] *a [nn-1] [nn] ; 
if (1 == (nn-1)) { Two roots found... 

p=0.5*(y-x); 
q=p*p+w; 

z=sqrt(fabs(q)); 
x += t; 

if (q >= 0.0) { ...a real pair. 

z=p+SIGN(z,p); 
wr[nn-1]=wr[nn]=x+z; 
if (z) wr [nn]=x-w/z; 
wi[nn-1]=wi[nn]=0.0; 

> else { ...a complex pair. 

wr [nn-1] =wr [nn] =x+p; 
wi[nn-l]= -(wi[nn]=z); 

> 

nn -= 2; 

> else { No roots found. Continue iteration, 

if (its == 30) nrerror("Too many iterations in hqr"); 
if (its ==10 II its == 20) { Form exceptional shift, 

t += x; 

for (i=l;i<=nn;i++) a[i][i] -= x; 
s=fabs(a[nn] [nn-1] )+fabs(a[nn-l] [nn-2]); 
y=x=0.75*s; 
w = -0.4375*s*s; 

> 

++its; 

for (m=(nn-2) ;m>=l;m—) { Form shift and then look for 


z=a [m] [m] ; 


2 consecutive small sub¬ 


diagonal elements. 


s-y-z; 

p=(r*s-w)/a[m+l] [m]+a[m] [m+1] ; Equation (11.6.23). 
q=a[m+l] [m+l]-z-r-s; 
r=a[m+2] [m+1] ; 

s=fabs(p)+fabs(q)+fabs(r) ; Scale to prevent overflow or 

p /= s; underflow, 

q /= s; 

r /= s; 

if (m == 1) break; 
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u=fabs (a[m] [m-1]) * (f abs(q)+fabs(r)); 

v=fabs(p)*(fabs(a[m-l] [m-1] )+fabs(z)+fabs(a[m+l] [m+1])); 
if ((float) (u+v) == v) break; Equation (11.6.26). 

> 

for (i=m+2;i<=nn;i++) { 
a[i] [i-2] =0.0; 

if (i != (m+2)) a[i] [i-3]=0.0; 

} 

for (k=m;k<=nn-l;k++) { 

Double QR step on rows 1 to nn and columns m to nn. 
if (k != m) { 

p=a[k] [k-1] ; Begin setup of Householder 

q=a[k+l] [k-1] ; vector. 

r=0.0; 

if (k != (nn-1)) r=a[k+2] [k-1] ; 
if ((x=fabs(p)+fabs(q)+fabs(r)) != 0.0) { 

p /= x; Scale to prevent overflow or 

q /= x; underflow, 

r /= x; 

> 

> 

if ((s=SIGN(sqrt(p*p+q*q+r*r),p)) != 0.0) { 
if (k == m) { 
if (1 ! = m) 

a[k][k-l] = -a[k] [k-1] ; 

> else 

a[k] [k-1] = -s*x; 

P += s; 
x=p/s; 
y=q/s; 
z=r/s; 
q /= 


Equations (11.6.24). 


' /= p; 

for (j=k;j<=nn;j++) { 

p=a[k] [j]+q*a[k+l] [j] ; 
if (k != (nn-1)) { 
p += r*a[k+2] [j] ; 
a[k+2] [j] -= p*z; 

> 

a[k+l] [j] -= p*y; 
a[k] [j] -= p*x; 

> 

mmin = nn<k+3 ? nn : k+3; 
for (i=l;i<=mmin;i++) { 

p=x*a[i] [k]+y*a[i] [k+1] ; 
if (k != (nn-1)) { 
p += z*a[i] [k+2] ; 
a[i] [k+2] -= p*r; 

> 

a[i] [k+1] -= p*q; 
a[i] [k] -= p; 

> 


Row modification. 


Column modification. 


} while (1 < nn-1); 
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11.7 Improving Eigenvalues and/or Finding 
Eigenvectors by Inverse Iteration 

The basic idea behind inverse iteration is quite simple. Let y be the solution 
of the linear system 


(A — rl) • y = b (11.7.1) 

where b is a random vector and r is close to some eigenvalue A of A. Then the 
solution y will be close to the eigenvector corresponding to A. The procedure can 
be iterated: Replace b by y and solve for a new y, which will be even closer to 
the true eigenvector. 

We can see why this works by expanding both y and b as linear combinations 
of the eigenvectors xj of A: 

y = Y a t x j b = Y ( 11 - 7 - 2 ) 

3 3 

Then (11.7.1) gives 

(11.7.3) 


(11.7.4) 


(11.7.5) 

If r is close to A n , say, then provided 6 n is not accidentally too small, y will be 
approximately x n , up to a normalization. Moreover, each iteration of this procedure 
gives another power of A j — r in the denominator of (11.7.5). Thus the convergence 
is rapid for well-separated eigenvalues. 

Suppose at the fcth stage of iteration we are solving the equation 


so that 


and 


Y a i( X i ~ r ) x i = Y 

3 3 

& x i 


y = 


\ ^ pj x j 

i~T 



(A - T fc 1) y = b fc 


(11.7.6) 
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where fo and r* are our current guesses for some eigenvector and eigenvalue of 
interest (let’s say, x„ and A n ). Normalize bfc so that fo • b/ = 1. The exact 
eigenvector and eigenvalue satisfy 

A • x„ = A„x n (11.7.7) 

so 

(A - Tfcl) • x„ = (A„ - r fc )x„ (11.7.8) 

Since y of (11.7.6) is an improved approximation to x n , we normalize it and set 

b l+ i = ^ (11.7.9) 

We get an improved estimate of the eigenvalue by substituting our improved guess 
y for x„ in (11.7.8). By (11.7.6), the left-hand side is bfc, so calling A n our new 
value Tk~ i-i, we find 


While the above formulas look simple enough, in practice the implementation 
can be quite tricky. The first question to be resolved is when to use inverse iteration. 
Most of the computational load occurs in solving the linear system (11.7.6). Thus 
a possible strategy is first to reduce the matrix A to a special form that allows easy 
solution of (11.7.6). Tridiagonal form for symmetric matrices or Hessenberg for 
nonsymmetric are the obvious choices. Then apply inverse iteration to generate 
all the eigenvectors. While this is an 0(N 3 ) method for symmetric matrices, it 
is many times less efficient than the QL method given earlier. In fact, even the 
best inverse iteration packages are less efficient than the QL method as soon as 
more than about 25 percent of the eigenvectors are required. Accordingly, inverse 
iteration is generally used when one already has good eigenvalues and wants only 
a few selected eigenvectors. 

You can write a simple inverse iteration routine yourself using LU decompo¬ 
sition to solve (11.7.6). You can decide whether to use the general LU algorithm 
we gave in Chapter 2 or whether to take advantage of tridiagonal or Hessenberg 
form. Note that, since the linear system (11.7.6) is nearly singular, you must be 
careful to use a version of LU decomposition like that in §2.3 which replaces a zero 
pivot with a very small number. 

We have chosen not to give a general inverse iteration routine in this book, 
because it is quite cumbersome to take account of all the cases that can arise. 
Routines are given, for example, in [1,2]. If you use these, or write your own routine, 
you may appreciate the following pointers. 

One starts by supplying an initial value to for the eigenvalue A„ of interest. 
Choose a random normalized vector bo as the initial guess for the eigenvector x n , 
and solve (11.7.6). The new vector y is bigger than bo by a “growth factor” |y|, 
which ideally should be large. Equivalently, the change in the eigenvalue, which by 
(11.7.10) is essentially 1/ |y|, should be small. The following cases can arise: 

• If the growth factor is too small initially, then we assume we have made 
a “bad” choice of random vector. This can happen not just because of 
a small (3 n in (11.7.5), but also in the case of a defective matrix, when 
(11.7.5) does not even apply (see, e.g., [1] or [3] for details). We go back 
to the beginning and choose a new initial vector. 
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• The change |b i — bo| might be less than some tolerance e. We can use this 
as a criterion for stopping, iterating until it is satisfied, with a maximum 
of 5 - 10 iterations, say. 

• After a few iterations, if |bfc+i — bis not decreasing rapidly enough, 
we can try updating the eigenvalue according to (11.7.10). If t k+\ = tu 
to machine accuracy, we are not going to improve the eigenvector much 
more and can quit. Otherwise start another cycle of iterations with the 
new eigenvalue. 

The reason we do not update the eigenvalue at every step is that when we solve 
the linear system (11.7.6) by LU decomposition, we can save the decomposition 
if Tfc is fixed. We only need do the backsubstitution step each time we update b &. 
The number of iterations we decide to do with a fixed Tfc is a trade-off between the 
quadratic convergence but 0(N 3 ) workload for updating Tfc at each step and the 
linear convergence but 0(N 2 ) load for keeping Tfc fixed. If you have determined the 
eigenvalue by one of the routines given earlier in the chapter, it is probably correct 
to machine accuracy anyway, and you can omit updating it. 

There are two different pathologies that can arise during inverse iteration. The 
first is multiple or closely spaced roots. This is more often a problem with symmetric 
matrices. Inverse iteration will find only one eigenvector for a given initial guess to- 
A good strategy is to perturb the last few significant digits in to and then repeat the 
iteration. Usually this provides an independent eigenvector. Special steps generally 
have to be taken to ensure orthogonality of the linearly independent eigenvectors, 
whereas the Jacobi and QL algorithms automatically yield orthogonal eigenvectors 
even in the case of multiple eigenvalues. 

The second problem, peculiar to nonsymmetric matrices, is the defective case. 
Unless one makes a “good” initial guess, the growth factor is small. Moreover, 
iteration does not improve matters. In this case, the remedy is to choose random 
initial vectors, solve (11.7.6) once, and quit as soon as any vector gives an acceptably 
large growth factor. Typically only a few trials are necessary. 

One further complication in the nonsymmetric case is that a real matrix can 
have complex-conjugate pairs of eigenvalues. You will then have to use complex 
arithmetic to solve (11.7.6) for the complex eigenvectors. For any moderate-sized 
(or larger) nonsymmetric matrix, our recommendation is to avoid inverse iteration 
in favor of a QR method that includes the eigenvector computation in complex 
arithmetic. You will find routines for this in [1,2] and other places. 
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12.0 Introduction 

A very large class of important computational problems falls under the general 
rubric of “Fourier transform methods” or “spectral methods.” For some of these 
problems, the Fourier transform is simply an efficient computational tool for 
accomplishing certain common manipulations of data. In other cases, we have 
problems for which the Fourier transform (or the related “power spectrum”) is itself 
of intrinsic interest. These two kinds of problems share a common methodology. 

Largely for historical reasons the literature on Fourier and spectral methods has 
been disjoint from the literature on “classical” numerical analysis. Nowadays there 
is no justification for such a split. Fourier methods are commonplace in research and 
we shall not treat them as specialized or arcane. At the same time, we realize that 
many computer users have had relatively less experience with this held than with, say, 
differential equations or numerical integration. Therefore our summary of analytical 
results will be more complete. Numerical algorithms, per se, begin in § 12.2. Various 
applications of Fourier transform methods are discussed in Chapter 13. 

A physical process can be described either in the time domain, by the values of 
some quantity h as a function of time t, e.g., h(t '), or else in the frequency domain, 
where the process is specified by giving its amplitude H (generally a complex 
number indicating phase also) as a function of frequency /, that is H(f), with 
—oo < / < oo. For many purposes it is useful to think of h(t) and H( f ) as being 
two different representations of the same function. One goes back and forth between 
these two representations by means of the Fourier transform equations, 


H(f) = J h(i)e 2wift dt 

h(t ) = f H(f)e~** ift df 


( 12 . 0 . 1 ) 


If t is measured in seconds, then / in equation (12.0.1) is in cycles per second, 
or Hertz (the unit of frequency). However, the equations work with other units too. If 
h is a function of position x (in meters), H will be a function of inverse wavelength 
(cycles per meter), and so on. If you are trained as a physicist or mathematician, you 
are probably more used to using angular frequency u, which is given in radians per 
sec. The relation between lo and /, H(oj) and H(f) is 

<*> = 27 rf H(u) = [H{f)\ f=u/ ^ (12.0.2) 
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and equation (12.0.1) looks like this 



We were raised on the cu-convention, but we changed! There are fewer factors of 
27 t to remember if you use the /-convention, especially when we get to discretely 
sampled data in §12.1. 

From equation (12.0.1) it is evident at once that Fourier transformation is a 
linear operation. The transform of the sum of two functions is equal to the sum of 
the transforms. The transform of a constant times a function is that same constant 
times the transform of the function. 

In the time domain, function h(t) may happen to have one or more special 
symmetries It might be purely real or purely imaginary or it might be even, 
h(t) = h(—t), or odd, h(t) = In the frequency domain, these symmetries 

lead to relationships between H(f) and The following table gives the 

correspondence between symmetries in the two domains: 



If... _ 

h(t ) is real 

h(t) is imaginary 

h(t ) is even 

h(t ) is odd 

h(t) is real and even 

h(t) is real and odd 

h(t) is imaginary and even 

h(t) is imaginary and odd 


then... 

H(-f) = [ H(f )]* 

H(-f ) = H(f) [i.e., H (/) is even] 

H(-f) = - H(f ) [i.e., H(f) is odd] 

H (/) is real and even 
H (/) is imaginary and odd 
H(f) is imaginary and even 
H(f) is real and odd 



In subsequent sections we shall see how to use these symmetries to increase 
computational efficiency. 



Here are some other elementary properties of the Fourier transform. (We’ll use 
the “4=>” symbol to indicate transform pairs.) If 

h(t) <=* H(f) 

is such a pair, then other transform pairs are 



i si 


h(at) 4= 

=► ] \ H {j 

\a\ a 

“time scaling” 

(12.0.4) 


=> H(bf) 

“frequency scaling” 

(12.0.5) 

h{t -1 0 ) 4= 

=> H(f) e 2 ” ift0 

“time shifting” 

(12.0.6) 

I e - 2nifot 4= 

=>) 

“frequency shifting” 

(12.0.7) 
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With two functions h(t) and g(t ), and their corresponding Fourier transforms 
H(f) and G(f), we can form two combinations of special interest. The convolution 
of the two functions, denoted g * h, is defined by 

g*h= J g(r)h(t — t) dr (12.0.8) 

Note that g * h is a function in the time domain and that g * h = h* g. It turns out 
that the function g * h is one member of a simple transform pair 

g * h <=> G(f)H(f) “Convolution Theorem” (12.0.9) 


In other words, the Fourier transform of the convolution is just the product of the 
individual Fourier transforms. 

The correlation of two functions, denoted Corr(</, h), is defined by 


Corr(g, h) = J 


g{T + t)h{r) dr 


( 12 . 0 . 10 ) 


The correlation is a function of t, which is called the lag. It therefore lies in the time 
domain, and it turns out to be one member of the transform pair: 


Corr(g, h) 4=> G(f)H*(f) “Correlation Theorem” (12.0.11) 


[More generally, the second member of the pair is G(f)H(—f), but we are restricting 
ourselves to the usual case in which g and h are real functions, so we take the liberty of 
setting H (—/) = This result shows that multiplying the Fourier transform 

of one function by the complex conjugate of the Fourier transform of the other gives 
the Fourier transform of their correlation. The correlation of a function with itself is 
called its autocorrelation. In this case (12.0.11) becomes the transform pair 

Corr(g,g) 4==> |G(/)| 2 “Wiener-Khinchin Theorem” (12.0.12) 


The total power in a signal is the same whether we compute it in the time 
domain or in the frequency domain. This result is known as Parseval’s theorem : 

Total Power =J \h(t)\ 2 dt=J \H(f)\ 2 df (12.0.13) 

Frequently one wants to know “how much power” is contained in the frequency 
interval between / and f + df. In such circumstances one does not usually 
distinguish between positive and negative /, but rather regards / as varying from 0 
(“zero frequency” or D.C.) to +oo. In such cases, one defines the one-sided power 
spectral density (PSD) of the function h as 

P h (f) = \H(f)\ 2 + \H(-f)\ 2 0 < / < oo (12.0.14) 



so that the total power is just the integral of Ph (/) from / = 0 to / = oo. When the 
function h(t) is real, then the two terms in (12.0.14) are equal, so F’h(f) = 2 \H(f)\ 2 . 
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Figure 12.0.1. Normalizations of one- and two-sided power spectra. The area under the square of the 
function, (a), equals the area under its one-sided power spectrum at positive frequencies, (b), and also 
equals the area under its two-sided power spectrum at positive and negative frequencies, (c). 

Be warned that one occasionally sees PSDs defined without this factor two. These, 
strictly speaking, are called two-sided power spectral densities, but some books 
are not careful about stating whether one- or two-sided is to be assumed. We 
will always use the one-sided density given by equation (12.0.14). Figure 12.0.1 
contrasts the two conventions. 

If the function hit) goes endlessly from —oo < t < oo, then its total power 
and power spectral density will, in general, be infinite. Of interest then is the (one- 
or two-sided) power spectral density per unit time. This is computed by taking a 
long, but finite, stretch of the function h(t), computing its PSD [that is, the PSD 
of a function that equals h(t) in the finite stretch but is zero everywhere else], and 
then dividing the resulting PSD by the length of the stretch used. Parseval’s theorem 
in this case states that the integral of the one-sided PSD-per-unit-time over positive 
frequency is equal to the mean square amplitude of the signal h (t). 

You might well worry about how the PSD-per-unit-time, which is a function of 
frequency /, converges as one evaluates it using longer and longer stretches of data. 
This interesting question is the content of the subject of “power spectrum estimation,” 
and will be considered below in §13.4—§13.7. A crude answer for now is: The 
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PSD-per-unit-time converges to finite values at all frequencies except those where 
h(t) has a discrete sine-wave (or cosine-wave) component of finite amplitude. At 
those frequencies, it becomes a delta-function, i.e., a sharp spike, whose width gets 
narrower and narrower, but whose area converges to be the mean square amplitude 
of the discrete sine or cosine component at that frequency. 

We have by now stated all of the analytical formalism that we will need in this 
chapter with one exception: In computational work, especially with experimental 
data, we are almost never given a continuous function h(t) to work with, but are 
given, rather, a list of measurements of /i(f *) for a discrete set of tfs. The profound 
implications of this seemingly unimportant fact are the subject of the next section. 


CITED REFERENCES AND FURTHER READING: 

Champeney, D.C. 1973, Fourier Transforms and Their Physical Applications (New York: Academic 
Press). 

Elliott, D.F., and Rao, K.R. 1982, Fast Transforms: Algorithms, Analyses, Applications (New York: 
Academic Press). 


12.1 Fourier Transform of Discretely Sampled 
Data 



In the most common situations, function h(t) is sampled (i.e., its value is 
recorded) at evenly spaced intervals in time. Let A denote the time interval between 
consecutive samples, so that the sequence of sampled values is 

h n = h(nA ) n= ..., —3, —2, —1,0,1,2,3,... (12.1.1) 

The reciprocal of the time interval A is called the sampling rate; if A is measured 
in seconds, for example, then the sampling rate is the number of samples recorded 
per second. 

Sampling Theorem and Aliasing 

For any sampling interval A, there is also a special frequency f c , called the 
Nyquist critical frequency, given by 


fc = 


1 

2A 


( 12 . 1 . 2 ) 


If a sine wave of the Nyquist critical frequency is sampled at its positive peak value, 
then the next sample will be at its negative trough value, the sample after that at 
the positive peak again, and so on. Expressed otherwise: Critical sampling of a 
sine wave is two sample points per cycle. One frequently chooses to measure time 
in units of the sampling interval A. In this case the Nyquist critical frequency is 
just the constant 1/2. 

The Nyquist critical frequency is important for two related, but distinct, reasons. 
One is good news, and the other bad news. First the good news. It is the remarkable 
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fact known as the sampling theorem : If a continuous function h(t), sampled at an 
interval A, happens to be bandwidth limited to frequencies smaller in magnitude than 
f c , i.e., if H(f) = 0 for all | /1 > f c , then the function h(t ) is completely determined 
by its samples h n . In fact, h(t') is given explicitly by the formula 


h(t) 


A yS sin[27T/ e (f - nA)] 
^ n 7T(f - nA) 


(12.1.3) 


This is a remarkable theorem for many reasons, among them that it shows that the 
“information content” of a bandwidth limited function is, in some sense, infinitely 
smaller than that of a general continuous function. Fairly often, one is dealing 
with a signal that is known on physical grounds to be bandwidth limited (or at 
least approximately bandwidth limited). For example, the signal may have passed 
through an amplifier with a known, finite frequency response. In this case, the 
sampling theorem tells us that the entire information content of the signal can be 
recorded by sampling it at a rate A _1 equal to twice the maximum frequency passed 
by the amplifier (cf. 12.1.2). 

Now the bad news. The bad news concerns the effect of sampling a continuous 
function that is not bandwidth limited to less than the Nyquist critical frequency. 
In that case, it turns out that all of the power spectral density that lies outside of 
the frequency range — f c < f < fc is spuriously moved into that range. This 
phenomenon is called aliasing. Any frequency component outside of the frequency 
range (— f c , f c ) is aliased (falsely translated) into that range by the very act of 
discrete sampling. You can readily convince yourself that two waves exp{2mf\t) 
and exp(27r*/2t) give the same samples at an interval A if and only if /i and 
fi differ by a multiple of 1 /A, which is just the width in frequency of the range 
(—f c , f c )• There is little that you can do to remove aliased power once you have 
discretely sampled a signal. The way to overcome aliasing is to (i) know the natural 
bandwidth limit of the signal — or else enforce a known limit by analog filtering 
of the continuous signal, and then (ii) sample at a rate sufficiently rapid to give at 
least two points per cycle of the highest frequency present. Figure 12.1.1 illustrates 
these considerations. 

To put the best face on this, we can take the alternative point of view: If a 
continuous function has been competently sampled, then, when we come to estimate 
its Fourier transform from the discrete samples, we can assume (or rather we might 
as well assume) that its Fourier transform is equal to zero outside of the frequency 
range in between — f c and f c . Then we look to the Fourier transform to tell whether 
the continuous function has been competently sampled (aliasing effects minimized). 
We do this by looking to see whether the Fourier transform is already approaching 
zero as the frequency approaches f c from below, or — f c from above. If, on the 
contrary, the transform is going towards some finite value, then chances are that 
components outside of the range have been folded back over onto the critical range. 


Discrete Fourier Transform 


We now estimate the Fourier transform of a function from a finite number of its 
sampled points. Suppose that we have N consecutive sampled values 



h k = h(t k ), t k = kA, 


k = 0,l,2,...,N-l 


(12.1.4) 
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JL f 

2A 

I 

(c) 

Figure 12.1.1. The continuous function shown in (a) is nonzero only for a finite interval of time T. 
It follows that its Fourier transform, whose modulus is shown schematically in (b), is not bandwidth 
limited but has finite amplitude for all frequencies. If the original function is sampled with a sampling 
interval A, as in (a), then the Fourier transform (c) is defined only between plus and minus the Nyquist 
critical frequency. Power outside that range is folded over or “aliased” into the range. The effect can be 
eliminated only by low-pass filtering the original function before sampling. 

so that the sampling interval is A. To make things simpler, let us also suppose that 
N is even. If the function h(t) is nonzero only in a finite interval of time, then 
that whole interval of time is supposed to be contained in the range of the N points 
given. Alternatively, if the function h(t) goes on forever, then the sampled points are 
supposed to be at least “typical” of what h(t) looks like at all other times. 

With N numbers of input, we will evidently be able to produce no more than 
N independent numbers of output. So, instead of trying to estimate the Fourier 
transform H(f) at all values of / in the range — f c to f c , let us seek estimates 
only at the discrete values 



The extreme values of n in (12.1.5) correspond exactly to the lower and upper limits 
of the Nyquist critical frequency range. If you are really on the ball, you will have 
noticed that there are N + 1, not N, values of n in (12.1.5); it will turn out that 
the two extreme values of n are not independent (in fact they are equal), but all the 
others are. This reduces the count to N. 
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The remaining step is to approximate the integral in (12.0.1) by a discrete sum: 

/ OO JV-1 JV-1 

h(t)e 2 * ifr ' l dt « h k e 2nifntk A = A ^ h k e 2nikn ' N 
■°° fc—0 fc=o 

( 12 . 1 . 6 ) 

Here equations (12.1.4) and (12.1.5) have been used in the final equality. The final 
summation in equation (12.1.6) is called the discrete Fourier transform of the N 
points hk- Let us denote it by H n , 


H n =J2 h k e 2nikn ' N 
fc =0 


(12.1.7) 


The discrete Fourier transform maps N complex numbers (the h k s) into N complex 
numbers (the Hf s). It does not depend on any dimensional parameter, such as the 
time scale A. The relation (12.1.6) between the discrete Fourier transform of a set 
of numbers and their continuous Fourier transform when they are viewed as samples 
of a continuous function sampled at an interval A can be rewritten as 

H(f n ) w A H n (12.1.8) 


where /„ is given by (12.1.5). 

Up to now we have taken the view that the index n in (12.1.7) varies from —N/ 2 
to N/2 (cf. 12.1.5). You can easily see, however, that (12.1.7) is periodic in n, with 

period N. Therefore, f7_ n = H^-n n = 1,2,_With this conversion in mind, 

one generally lets the n in H n vary from 0 to IV — 1 (one complete period). Then n 
and k (in hk) vary exactly over the same range, so the mapping of N numbers into 
N numbers is manifest. When this convention is followed, you must remember that 
zero frequency corresponds to n = 0, positive frequencies 0 < f < f c correspond 
to values 1 < n < N/2 — 1, while negative frequencies — f c < f < 0 correspond to 
A/2+1 < n < N- I. The value n = N/2 corresponds to both f = f c and / = —f c . 

The discrete Fourier transform has symmetry properties almost exactly the same 
as the continuous Fourier transform. For example, all the symmetries in the table 
following equation (12.0.3) hold if we read hk for h(t), H n for H(f), and Hn- u 
for H(—f ). (Likewise, “even” and “odd” in time refer to whether the values h ;■ at k 
and N — k are identical or the negative of each other.) 

The formula for the discrete inverse Fourier transform, which recovers the set 
of hk s exactly from the H n ’s is: 

JV-1 

hk = ffH Hn e~ 2 * iknlN (12.1.9) 

n=0 

Notice that the only differences between (12.1.9) and (12.1.7) are (i) changing the 
sign in the exponential, and (ii) dividing the answer by N. This means that a 
routine for calculating discrete Fourier transforms can also, with slight modification, 
calculate the inverse transforms. 
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The discrete form of Parseval’s theorem is 


JV-l N-l 

Ei^i 2 = ^£i^i 2 ( 12 - L1 °) 

fc=0 n =0 

There are also discrete analogs to the convolution and correlation theorems (equations 
12.0.9 and 12.0.11), but we shall defer them to §13.1 and §13.2, respectively. 

CITED REFERENCES AND FURTHER READING: 

Brigham, E.O. 1974, The Fast Fourier Transform (Englewood Cliffs, NJ: Prentice-Hall). 

Elliott, D.F., and Rao, K.R. 1982, Fast Transforms: Algorithms, Analyses, Applications (New York: 
Academic Press). 


12.2 Fast Fourier Transform (FFT) 

How much computation is involved in computing the discrete Fourier transform 
(12.1.7) of N points? For many years, until the mid-1960s, the standard answer 
was this: Define W as the complex number 

W = e 2ni/N (12.2.1) 

Then (12.1.7) can be written as 

JV-l 

H n = W nk h k (12.2.2) 

k =0 


In other words, the vector of h k s is multiplied by a matrix whose (n, fc)th element 
is the constant W to the power n x k. The matrix multiplication produces a vector 
result whose components are the H n ' s. This matrix multiplication evidently requires 
N 2 complex multiplications, plus a smaller number of operations to generate the 
required powers of W. So, the discrete Fourier transform appears to be an 0(N 2 ) 
process. These appearances are deceiving! The discrete Fourier transform can, 
in fact, be computed in 0(N log 2 N) operations with an algorithm called the fast 
Fourier transform, or FFT. The difference between IV log 2 N and N 2 is immense. 
With N = 10 6 , for example, it is the difference between, roughly, 30 seconds of CPU 
time and 2 weeks of CPU time on a microsecond cycle time computer. The existence 
of an FFT algorithm became generally known only in the mid-1960s, from the work 
of J.W. Cooley and J.W. Tukey. Retrospectively, we now know (see [1 ]) that efficient 
methods for computing the DFT had been independently discovered, and in some 
cases implemented, by as many as a dozen individuals, starting with Gauss in 1805! 

One “rediscovery” of the FFT, that of Danielson and Lanczos in 1942, provides 
one of the clearest derivations of the algorithm. Danielson and Lanczos showed 
that a discrete Fourier transform of length N can be rewritten as the sum of two 
discrete Fourier transforms, each of length N/2. One of the two is formed from the 
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even-numbered points of the original N, the other from the odd-numbered points. 
The proof is simply this: 




In the last line, W is the same complex constant as in (12.2.1), F£ denotes the fcth 
component of the Fourier transform of length N/2 formed from the even components 
of the original //s, while Fg is the corresponding transform of length N/2 formed 
from the odd components. Notice also that k in the last line of (12.2.3) varies from 
0 to N, not just to N/2. Nevertheless, the transforms F/ and Fg are periodic in k 
with length N/2. So each is repeated through two cycles to obtain F^. 

The wonderful thing about the Danielson-Lanczos Lemma is that it can be used 
recursively. Having reduced the problem of computing F^ to that of computing 
F/ and F%, we can do the same reduction of F/ to the problem of computing 
the transform of its N/ 4 even-numbered input data and N/4 odd-numbered data. 
In other words, we can define F/ e and F/° to be the discrete Fourier transforms 
of the points which are respectively even-even and even-odd on the successive 
subdivisions of the data. 

Although there are ways of treating other cases, by far the easiest case is the 
one in which the original N is an integer power of 2. In fact, we categorically 
recommend that you only use FFTs with N a power of two. If the length of your data 
set is not a power of two, pad it with zeros up to the next power of two. (We will give 
more sophisticated suggestions in subsequent sections below.) With this restriction 
on N, it is evident that we can continue applying the Danielson-Lanczos Lemma 
until we have subdivided the data all the way down to transforms of length 1. What 
is the Fourier transform of length one? It is just the identity operation that copies its 
one input number into its one output slot! In other words, for every pattern of log 2 N 
e’s and o’s, there is a one-point transform that is just one of the input numbers f n 

F eoeeoeo -oee = ^ for some n (12.2.4) 

(Of course this one-point transform actually does not depend on k, since it is periodic 
in k with period 1.) 

The next trick is to figure out which value of n corresponds to which pattern of 
e’s and o’s in equation (12.2.4). The answer is: Reverse the pattern of e’s and o’s, 
then let e = 0 and o = 1, and you will have, in binary the value of n. Do you see 
why it works? It is because the successive subdivisions of the data into even and odd 
are tests of successive low-order (least significant) bits of n. This idea of bit reversal 
can be exploited in a very clever way which, along with the Danielson-Lanczos 
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(a) (b) 

Figure 12.2.1. Reordering an array (here of length 8) by bit reversal, (a) between two arrays, versus (b) 
in place. Bit reversal reordering is a necessary part of the fast Fourier transform (FFT) algorithm. 


Lemma, makes FFTs practical: Suppose we take the original vector of data fj 
and rearrange it into bit-reversed order (see Figure 12.2.1), so that the individual 
numbers are in the order not of j, but of the number obtained by bit-reversing j. 
Then the bookkeeping on the recursive application of the Danielson-Lanczos Lemma 
becomes extraordinarily simple. The points as given are the one-point transforms. 
We combine adjacent pairs to get two-point transforms, then combine adjacent pairs 
of pairs to get 4-point transforms, and so on, until the first and second halves of 
the whole data set are combined into the final transform. Each combination takes 
of order N operations, and there are evidently log 2 TV combinations, so the whole 
algorithm is of order TV log 2 TV (assuming, as is the case, that the process of sorting 
into bit-reversed order is no greater in order than TV log 2 TV). 

This, then, is the structure of an FFT algorithm: It has two sections. The first 
section sorts the data into bit-reversed order. Luckily this takes no additional storage, 
since it involves only swapping pairs of elements. (If k i is the bit reverse of fc 2 , then 
k -2 is the bit reverse of k \.) The second section has an outer loop that is executed 
log 2 TV times and calculates, in turn, transforms of length 2,4, 8 ,.... TV. For each 
stage of this process, two nested inner loops range over the subtransforms already 
computed and the elements of each transform, implementing the Danielson-Lanczos 
Lemma. The operation is made more efficient by restricting external calls for 
trigonometric sines and cosines to the outer loop, where they are made only log 2 TV 
times. Computation of the sines and cosines of multiple angles is through simple 
recurrence relations in the inner loops (cf. 5.5.6). 

The FFT routine given below is based on one originally written by N. M. 
Brenner. The input quantities are the number of complex data points (nn), the data 
array (data [1. . 2*nn]), and isign, which should be set to either ±1 and is the sign 
of i in the exponential of equation (12.1.7). When isign is set to —1, the routine 
thus calculates the inverse transform (12.1.9) — except that it does not multiply by 
the normalizing factor 1/TV that appears in that equation. You can do that yourself. 

Notice that the argument nn is the number of complex data points. The actual 
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length of the real array (data [1. . 2*nn]) is 2 times nn, with each complex value 
occupying two consecutive locations. In other words, data[l] is the real part of 
fo, data[2] is the imaginary part of fo, and so on up to data[2*nn-l], which 
is the real part of /n-i, and data[2*nn], which is the imaginary part of /jv-i- 
The FFT routine gives back the F n 's packed in exactly the same fashion, as nn 
complex numbers. 

The real and imaginary parts of the zero frequency component Fo are in dat a [ 1 ] 
and data [2]; the smallest nonzero positive frequency has real and imaginary parts in 
data [3] and data [4]; the smallest (in magnitude) nonzero negative frequency has 
real and imaginary parts in data [2*nn-l] and data [2*nn]. Positive frequencies 
increasing in magnitude are stored in the real-imaginary pairs data [5] , data [6] 
up to data [nn-1] , data [nn]. Negative frequencies of increasing magnitude are 
stored in data [2*nn-3] , data [2*nn-2] down to data [nn+3] , data[nn+4]. 
Finally, the pair data [nn+1] , data [nn+2] contain the real and imaginary parts of 
the one aliased point that contains the most positive and the most negative frequency. 
You should try to develop a familiarity with this storage arrangement of complex 
spectra, also shown in Figure 12.2.2, since it is the practical standard. 

This is a good place to remind you that you can also use a routine like f ourl 
without modification even if your input data array is zero-offset, that is has the range 
data[0. . 2*nn-l], In this case, simply decrement the pointer to data by one when 
f ourl is invoked, e.g., f ourl (data-1,1024,1) ;. The real part of fo will now be 
returned in data[0], the imaginary part in data[l], and so on. See §1.2. 


#include <math.h> 

#define SWAP(a,b) tempr=(a);(a)=(b);(b)=tempr 


void fourl(float data[], unsigned long nn, int isign) 

Replaces data[l. .2*nn] by its discrete Fourier transform, if isign is input as 1; or replaces 
data[l. ,2*nn] by nn times its inverse discrete Fourier transform, if isign is input as —1. 
data is a complex array of length nn or, equivalently, a real array of length 2*nn. nn MUST 
be an integer power of 2 (this is not checked fori). 


unsigned long n,nmiax,m, j ,istep,i; 
double wtemp,wr,wpr,wpi,wi,theta; 
float tempr,tempi; 

n=nn « 1; 

for (i=l;i<n;i+=2) { 
if (j > i) { 

SWAP (data [j] ,data[i]); 

SWAP (data [j+1] ,data[i+l]); 

> 


Double precision for the trigonomet¬ 
ric recurrences. 


This is the bit-reversal section of the 
routine. 

Exchange the two complex numbers. 


Outer loop executed log 2 nn times. 
Initialize the trigonometric recurrence. 


m=nn; 

while (m >= 2 && j > m) { 
j -= m; 
m »= 1; 

> 

j += m; 

> 

Here begins the Danielson-Lanczos section of the routine. 
mmax=2; 

while (n > mmax) { 
istep=mmax « 1; 

theta=isign*(6.28318530717959/mmax); 
wtemp=sin(0.5*theta); 
wpr = -2.0*wtemp*wtemp; 
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(a) (b) 

Figure 12.2.2. Input and output arrays for FFT. (a) The input array contains N (a power of 2) 
complex time samples in a real array of length 2 N, with real and imaginary parts alternating, (b) The 
output array contains the complex Fourier spectrum at N values of frequency. Real and imaginary parts 
again alternate. The array starts with zero frequency, works up to the most positive frequency (which 
is ambiguous with the most negative frequency). Negative frequencies follow, from the second-most 
negative up to the frequency just below zero. 



wpi=sin(theta); 
wr=l.0; 
wi=0.0; 

for (m=l;m<imnax;m+=2) { 

for (i=m;i<=n;i+=istep) { 
j=i+mmax; 

tempr=wr*data[j]-wi*data[j+l] ; 
tempi=wr*data[j+l]+wi*data[j] ; 
data[j]=data[i]-tempr; 
data [j+1] =data [i+1] -tempi; 
data[i] += tempr; 
data[i+1] += tempi; 

> 

wr=(wtemp=wr)*wpr-wi*wpi+wr; 
wi=wi*wpr+wtemp*wpi+wi; 

> 

mmax=istep; 

> 

> 


Here are the two nested inner loops. 

This is the Danielson-Lanczos for¬ 
mula: 


Trigonometric recurrence. 



(A double precision version of f ourl, named df ourl, is used by the routine mpmul 
in §20.6. You can easily make the conversion, or else get the converted routine 
from the Numerical Recipes diskette.) 
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Other FFT Algorithms 

We should mention that there are a number of variants on the basic FFT algorithm 
given above. As we have seen, that algorithm first rearranges the input elements 
into bit-reverse order, then builds up the output transform in log 2 N iterations. In 
the literature, this sequence is called a decimation-in-time or Cooley-Tukey FFT 
algorithm. It is also possible to derive FFT algorithms that first go through a set of 
log 2 N iterations on the input data, and rearrange the output values into bit-reverse 
order. These are cal led decimation-in-frequency or Sande-Tukey FFT algorithms. For 
some applications, such as convolution (§13.1), one takes a data set into the Fourier 
domain and then, after some manipulation, back out again. In these cases it is possible 
to avoid all bit reversing. You use a decimation-in-frequency algorithm (without its 
bit reversing) to get into the “scrambled” Fourier domain, do your operations there, 
and then use an inverse algorithm (without its bit reversing) to get back to the time 
domain. While elegant in principle, this procedure does not in practice save much 
computation time, since the bit reversals represent only a small fraction of an FFT’s 
operations count, and since most useful operations in the frequency domain require 
a knowledge of which points correspond to which frequencies. 

Another class of FFTs subdivides the initial data set of length N not all the 
way down to the trivial transform of length 1, but rather only down to some other 
small power of 2, for example N = 4, base-4 FFTs, or N = 8, base-8 FFTs. These 
small transforms are then done by small sections of highly optimized coding which 
take advantage of special symmetries of that particular small N. For example, for 
N = 4, the trigonometric sines and cosines that enter are all ±1 or 0, so many 
multiplications are eliminated, leaving largely additions and subtractions. These 
can be faster than simpler FFTs by some significant, but not overwhelming, factor, 
e.g., 20 or 30 percent. 

There are also FFT algorithms for data sets of length N not a power of two. They 
work by using relations analogous to the Danielson-Lanczos Lemma to subdivide 
the initial problem into successively smaller problems, not by factors of 2, but by 
whatever small prime factors happen to divide N. The larger that the largest prime 
factor of N is, the worse this method works. If N is prime, then no subdivision 
is possible, and the user (whether he knows it or not) is taking a slow Fourier 
transform, of order N 2 instead of order AHog 2 N. Our advice is to stay clear 
of such FFT implementations, with perhaps one class of exceptions, the Winograd 
Fourier transform algorithms. Winograd algorithms are in some ways analogous to 
the base-4 and base-8 FFTs. Winograd has derived highly optimized codings for 
taking small-iV discrete Fourier transforms, e.g., for N = 2,3,4,5,7,8,11,13,16. 
The algorithms also use a new and clever way of combining the subfactors. The 
method involves a reordering of the data both before the hierarchical processing and 
after it, but it allows a significant reduction in the number of multiplications in the 
algorithm. For some especially favorable values of N, the Winograd algorithms can 
be significantly (e.g., up to a factor of 2) faster than the simpler FFT algorithms 
of the nearest integer power of 2. This advantage in speed, however, must be 
weighed against the considerably more complicated data indexing involved in these 
transforms, and the fact that the Winograd transform cannot be done “in place.” 

Finally, an interesting class of transforms for doing convolutions quickly are 
number theoretic transforms. These schemes replace floating-point arithmetic with 
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integer arithmetic modulo some large prime N+ 1, and the iVth root of 1 by the 
modulo arithmetic equivalent. Strictly speaking, these are not Fourier transforms 
at all, but the properties are quite similar and computational speed can be far 
superior. On the other hand, their use is somewhat restricted to quantities like 
correlations and convolutions since the transform itself is not easily interpretable 
as a “frequency” spectrum. 
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Bloomfield, P. 1976, Fourier Analysis of Time Series-An Introduction (New York: Wiley). 

Van Loan, C. 1992, Computational Frameworks for the Fast Fourier Transform (Philadelphia: 
S.I.A.M.). 

Beauchamp, K.G. 1984, Applications of Walsh Functions and Related Functions (New York: 
Academic Press) [non-Fourier transforms]. 

Heideman, M.T., Johnson, D.H., and Burris, C.S. 1984, IEEE ASSP Magazine, pp. 14-21 (Oc¬ 
tober). 


12.3 FFT of Real Functions, Sine and Cosine 
Transforms 


It happens frequently that the data whose FFT is desired consist of real-valued 
samples fj , j = Q ... N — 1. To use f ourl, we put these into a complex array 
with all imaginary parts set to zero. The resulting transform F n , n — 0 ... N — 1 
satisfies Fjv- n * = F n . Since this complex-valued array has real values for Fq 
and F n / 2 , and (N/ 2) — 1 other independent values F\ ... F N / 2 -\, it has the same 
2(N/2 — 1) + 2 = N “degrees of freedom” as the original, real data set. However, 
the use of the full complex FFT algorithm for real data is inefficient, both in execution 
time and in storage required. You would think that there is a better way. 

There are two better ways. The first is “mass production”: Pack two separate 
real functions into the input array in such a way that their individual transforms can 
be separated from the result. This is implemented in the program twofft below. 
This may remind you of a one-cent sale, at which you are coerced to purchase 
two of an item when you only need one. However, remember that for correlations 
and convolutions the Fourier transforms of two functions are involved, and this is a 
handy way to do them both at once. The second method is to pack the real input 
array cleverly, without extra zeros, into a complex array of half its length. One then 
performs a complex FFT on this shorter length; the trick is then to get the required 
answer out of the result. This is done in the program realf t below. 
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Transform of Two Real Functions Simultaneously 


First we show how to exploit the symmetry of the transform F n to handle 
two real functions at once: Since the input data fj are real, the components of the 
discrete Fourier transform satisfy 

F N _ n = (F„)* (12.3.1) 

where the asterisk denotes complex conjugation. By the same token, the discrete 
Fourier transform of a purely imaginary set of gj's has the opposite symmetry. 

G N _ n = -(<?„)* (12.3.2) 

Therefore we can take the discrete Fourier transform of two real functions each of 
length N simultaneously by packing the two data arrays as the real and imaginary 
parts, respectively, of the complex input array of f our 1. Then the resulting transform 
array can be unpacked into two complex arrays with the aid of the two symmetries. 
Routine twofft works out these ideas. 


void twofft(float datal[], float data2[], float fftl[], float fft2[], 
unsigned long n) 

Given two real input arrays datal[l. .n] and data2[l. .n], this routine calls fourl and 
returns two complex output arrays, fftl [1. . 2n] and fft2 [1. . 2n] , each of complex length 
n (i.e., real length 2*n), which contain the discrete Fourier transforms of the respective data 
arrays, n MUST be an integer power of 2. 

{ 

void fourl(float data[], unsigned long nn, int isign); 
unsigned long nn3,nn2,jj,j; 
float rep,rem,aip,aim; 

nn3=l+(nn2=2+n+n); 
for (j=l,jj=2;j<=n;j++,jj+=2) { 
fftl [jj-l]=datal [j] ; 
fftl [j j] =data2 [j] ; 

> 

fourl(fftl,n,l); 
fft2 [1] =fftl [2] ; 
fftl [2] =f ft2 [2] =0.0; 
for (j=3;j<=n+l;j+=2) { 

rep=0.5*(fftl [j]+fftl[nn2-j]); 
rem=0.5*(fftl [j]-fftl[nn2-j]); 
aip=0.5*(fftl[j+1]+fft1[nn3-j]); 
aim=0.5*(fftl [j+1]-fftl [nn3-j] ) ; 
fftl[j]=rep; Ship them out in two complex arrays, 

fftl [j+l]=aim; 
fftl [nn2-j]=rep; 
fftl[nn3-j] = -aim; 
f ft2 [j] =aip; 
fft2[j+l] = -rem; 
fft2[nn2-j]=aip; 
fft2[nn3-j]=rem; 


Pack the two real arrays into one com¬ 
plex array. 


Transform the complex array. 


Use symmetries to separate the two trans¬ 
forms. 



> 


> 
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What about the reverse process? Suppose you have two complex transform 
arrays, each of which has the symmetry (12.3.1), so that you know that the inverses 
of both transforms are real functions. Can you invert both in a single FFT? This is 
even easier than the other direction. Use the fact that the FFT is linear and form 
the sum of the first transform plus i times the second. Invert using f ourl with 
isign = — 1. The real and imaginary parts of the resulting complex array are the 
two desired real functions. 

FFT of Single Real Function 

To implement the second method, which allows us to perform the FFT of 
a single real function without redundancy, we split the data set in half, thereby 
forming two real arrays of half the size. We can apply the program above to these 
two, but of course the result will not be the transform of the original data. It will 
be a schizophrenic combination of two transforms, each of which has half of the 
information we need. Fortunately, this schizophrenia is treatable. It works like this: 

The right way to split the original data is to take the even-numbered fj as 
one data set, and the odd-numbered fj as the other. The beauty of this is that 
we can take the original real array and treat it as a complex array hj of half the 
length. The first data set is the real part of this array, and the second is the 
imaginary part, as prescribed for twof f t. No repacking is required. In other words 

hj = fej + if'ij+i , j = 0. V/2 1. We submit this to f ourl, and it will give 

back a complex array H n = Ff + iF°, n = 0,..., N/2 — 1 with 

N/ 2-1 

n= £ f 2k e ^ikn/{N/2) 

k=0 (12.3.3) 

N/ 2—1 v ’ 

F„° = £ f 2k+ re 2 */^ 

fc=o 

The discussion of program twof ft tells you how to separate the two transforms 
F® and F° out of H n . How do you work them into the transform F n of the original 
data set fjl Simply glance back at equation (12.2.3): 

F„ = F® + e 2 ™ n / N p° n = 0,...,N- 1 (12.3.4) 

Expressed directly in terms of the transform H n of our real (masquerading as 
complex) data set, the result is 

F n = \{H n + H n/ 2 _„*) -\[H n - H n/2 _ n *)e 2 ™' N n = 0,..., N - 1 

(12.3.5) 



A few remarks: 

• Since F/v-n* = F n there is no point in saving the entire spectrum. The 
positive frequency half is sufficient and can be stored in the same array as 
the original data. The operation can, in fact, be done in place. 

• Even so, we need values H n , n = 0,..., N/2 whereas f ourl gives only 
the values n = 0,..., N/2 — 1. Symmetry to the rescue, H N / 2 = Hq. 
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• The values Fo and F N / 2 are real and independent. In order to actually 
get the entire F n in the original array space, it is convenient to put F N / 2 
into the imaginary part of F$. 

• Despite its complicated form, the process above is invertible. First peel 
Fjv /2 out of Fo. Then construct 

F* = \{F n + F* N/2 _ J 

2 n = ()..... N/2 - 1 (12.3.6) 

F o = _ e -2ni n/N { F n _F* N/2 _ n ) 

and use fourl to find the inverse transform of H n = Fn ] + iF^\ 
Surprisingly, the actual algebraic steps are virtually identical to those of 
the forward transform. 

Here is a representation of what we have said: 


#include <math.h> 

void realft(float data[], unsigned long n, int isign) 

Calculates the Fourier transform of a set of n real-valued data points. Replaces this data (which 
is stored in array data[l. ,n] ) by the positive frequency half of its complex Fourier transform. 
The real-valued first and last components of the complex transform are returned as elements 
data[l] and data [2] , respectively, n must be a power of 2. This routine also calculates the 
inverse transform of a complex data array if it is the transform of real data. (Result in this case 
must be multiplied by 2/n.) 

{ 

void fourl(float data[], unsigned long nn, int isign); 
unsigned long i,il,i2,i3,i4,np3; 
float cl=0.5,c2,hlr,hli,h2r,h2i; 

double wr.wi.wpr.wpi.wtemp,theta; Double precision for the trigonomet¬ 

ric recurrences. 

theta=3.141592653589793/(double) (n»l) ; Initialize the recurrence, 
if (isign == 1) { 
c2 = -0.5; 

fourl (data, n»l,l); The forward transform is here. 

> else { 

c2=0.5; Otherwise set up for an inverse trans¬ 
theta = -theta; form. 

} 

wtemp=sin(0.5*theta); 
wpr = -2.0*wtemp*wtemp; 
wpi=sin(theta); 
wr=1.0+wpr; 


np3=n+3; 

for (i=2;i<=(n»2) ;i++) { 

i4=l+(i3=np3-(i2=l+(il=i+i-l))) ; 
hlr=cl*(data[il]+data[i3]); 
hli=cl*(data[i2]-data[i4]); 
h2r = -c2*(data[i2]+data[i4]) ; 
h2i=c2*(data[il]-data[i3]); 
data[il]=hlr+wr*h2r-wi*h2i; 
data[i2]=hli+wr*h2i+wi*h2r; 
data[i3]=hlr-wr*h2r+wi*h2i; 
data[i4] = -hli+wr*h2i+wi*h2r; 
wr=(wtemp=wr)*wpr-wi*wpi+wr; 
wi=ui*wpr+wtemp*wpi+wi; 

} 

if (isign == 1) { 


Case i=l done separately below. 

The two separate transforms are sep¬ 
arated out of data. 


Here they are recombined to form 
the true transform of the origi¬ 
nal real data. 

The recurrence. 
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data[l] = (hlr=data[l] )+data[2] ; 
data [2] = hlr-data[2] ; 

> else { 

data[l] =cl* ((hlr=data[l] )+data[2] ) ; 
data[2]=cl*(hlr-data[2]); 
fourl(data,n»l,-l) ; 

> 

> 


Squeeze the first and last data to¬ 
gether to get them all within the 
original array. 


This is the inverse transform for the 
case isign=-l. 


Fast Sine and Cosine Transforms 

Among their other uses, the Fourier transforms of functions can be used to solve 
differential equations (see §19.4). The most common boundary conditions for the 
solutions are 1) they have the value zero at the boundaries, or 2) their derivatives 
are zero at the boundaries. In the first instance, the natural transform to use is the 
sine transform, given by 


JV-l 

F * = E fj sm(njk/N) sine transform (12.3.7) 

j= i 

where fj, j = 0,..., N — 1 is the data array, and /o = 0. 

At first blush this appears to be simply the imaginary part of the discrete Fourier 
transform. However, the argument of the sine differs by a factor of two from the 
value that would make this so. The sine transform uses sines only as a complete set 
of functions in the interval from 0 to 27r, and, as we shall see, the cosine transform 
uses cosines only. By contrast, the normal FFT uses both sines and cosines, but only 
half as many of each. (See Figure 12.3.1.) 

The expression (12.3.7) can be “force-fit” into a form that allows its calculation 
via the FFT. The idea is to extend the given function rightward past its last tabulated 
value. We extend the data to twice their length in such a way as to make them an 
odd function about j = N, with /\y = 0, 

f 2N _ j = -f j j = 0,...,N^l (12.3.8) 


Consider the FFT of this extended function: 

2JV-1 

F k = f je 2nijk ^ 2N) (12.3.9) 

j =o 


The half of this sum from j = N to j = 2A 7 — 1 can be rewritten with the 
substitution j' = 2 N — j 


Y f. e 2*ijk/(2N) _ Y f 2S j, c 2 ^ 2jS j') fc / (2/v ) 
j=N j' = 1 

JV-l 

= _ Y fj>e~ W*/( 2JV ) 
i'=o 



(12.3.10) 
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(a) 


(b) 


(c) 



Figure 12.3.1. Basis functions used by the Fourier transform (a), sine transform (b), and cosine transform 
(c), are plotted. The first five basis functions are shown in each case. (For the Fourier transform, the real 
and imaginary parts of the basis functions are both shown.) While some basis functions occur in more 
than one transform, the basis sets are distinct. For example, the sine transform functions labeled (1), (3), 
(5) are not present in the Fourier basis. Any of the three sets can expand any function in the interval 
shown; however, the sine or cosine transform best expands functions matching the boundary conditions 
of the respective basis functions, namely zero function values for sine, zero derivatives for cosine. 


so that 


N—l 

^ = E fi 

j =0 

IV-1 

= 2* fj sin(7T jk/N) 

3=0 


e 2wijk/(2N) _ e -27vijk/(2N) 


(12.3.11) 



Thus, up to a factor 2* we get the sine transform from the FFT of the extended function. S. §- o 

This method introduces a factor of two inefficiency into the computation by ® § 

extending the data. This inefficiency shows up in the FFT output, which has 
zeros for the real part of every element of the transform. For a one-dimensional 
problem, the factor of two may be bearable, especially in view of the simplicity 
of the method. When we work with partial differential equations in two or three 
dimensions, though, the factor becomes four or eight, so efforts to eliminate the 
inefficiency are well rewarded. 
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From the original real data array fj we will construct an auxiliary array y :j and 
apply to it the routine realf t. The output will then be used to construct the desired 
transform. For the sine transform of data fj, j = 1,..., N — 1, the auxiliary array is 

2/o = 0 

1 (12.3.12) 

Vi = sm(jir/N)(fj + f N _£ + -{fj - f N _j) j=l,...,N-l 

This array is of the same dimension as the original. Notice that the first term is 
symmetric about j = N/2 and the second is antisymmetric. Consequently, when 
realf t is applied to y 3 , the result has real parts Ilk and imaginary parts Ik given by 


Rk = ^2 y i cos (2t rjk/N) 
j=o 
N-t 

= Z (fj + /•'' i) sm(j7r/AT) cos(2t rjk/N) 


3~i 
N—l 

= ^2 2fjsm(jn/N)cos(2njk/N) 
3=0 
N—l 

= Zfs 

3=0 

= Rhk+i ~ Rhk-i 


. ( 2 k + l)jjr . ( 2 k - \)jir 
Sm ^V- Sm N 


N—l 

h = Z sm{2njk/N) 

3=0 

N-l 

= Z (fj - f* i) 2 iiin (2njk/N) 

3 =1 
JV-1 

= Z fj si n{2irjk/N) 

j=o 
= F 2k 


(12.3.13) 


(12.3.14) 


Therefore F k can be determined as follows: 

F 2k = h F 2k+1 = F 2k - 1 + R k fc = 0,..., {N/2 — 1) (12.3.15) 

The even terms of F k are thus determined very directly. The odd terms require 
a recursion, the starting point of which follows from setting k = 0 in equation 
(12.3.15) and using F\ = —F_i: 





(12.3.16) 


The implementing program is 
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#include <math.h> 

void sinft(float y[] , int n) 

Calculates the sine transform of a set of n real-valued data points stored in array y[l. . n] . 
The number n must be a power of 2. On exit y is replaced by its transform. This program, 
without changes, also calculates the inverse sine transform, but in this case the output array 
should be multiplied by 2/n. 

{ 

void realft(float data[], unsigned long n, int isign); 
int j,n2=n+2; 
float sum,yl,y2; 

double theta,wi=0.0,wr=1.0,wpi,wpr,wtemp; Double precision in the trigono¬ 

metric recurrences. 

theta=3.14159265358979/(double) n; Initialize the recurrence. 

wtemp=sin(0.5*theta); 
wpr = -2.0*wtemp*wtemp; 
wpi=sin(theta); 
y [1] =0.0; 

for (j=2;j<=(n»l)+l;j++) { 

wr=(wtemp=wr)*wpr-wi*wpi+wr ; Calculate the sine for the auxiliary array. 
wi=wi*wpr+wtemp*wpi+wi; 
yl=wi*(y [j]+y[n2-j]); 
y2=0.5* (y [j] -y [n2-j] ) ; 
y[j]=yl+y2; 
y[n2-j]=yl-y2; 

> 

realft(y,n,l); 
y [1] *=0.5; 
sum=y[2]=0.0; 
for (j=l;j<=n-l;j+=2) { 
sum += y [j] ; 
y[j]=y[j+l]; 
y[j+l]=sum; 

> 


The sine transform, curiously, is its own inverse. If you apply it twice, you get the 
original data, but multiplied by a factor of N/2. 

The other common boundary condition for differential equations is that the 
derivative of the function is zero at the boundary. In this case the natural transform 
is the cosine transform. There are several possible ways of defining the transform. 
Each can be thought of as resulting from a different way of extending a given array 
to create an even array of double the length, and/or from whether the extended array 
contains 2N — 1, 2 N, or some other number of points. In practice, only two of the 
numerous possibilities are useful so we will restrict ourselves to just these two. 

The first form of the cosine transform uses N + 1 data points: 


The cosine is needed to continue the recurrence. 
Construct the auxiliary array. 

Terms j and N — j are related. 


Transform the auxiliary array. 

Initialize the sum used for odd terms below. 


Even terms determined directly. 

Odd terms determined by this running sum. 


N -1 

Fk = dfo + (-l) fe /iv] + £ h cos(t rjk/N) (12.3.17) 

3= 1 

It results from extending the given array to an even array about j = N, with 

f2N-j = fj, j = 0,...,N-l (12.3.18) 



If you substitute this extended array into equation (12.3.9), and follow steps analogous 
to those leading up to equation (12.3.11), you will find that the Fourier transform is 
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just twice the cosine transform (12.3.17). Another way of thinking about the formula 
(12.3.17) is to notice that it is the Chebyshev Gauss-Lobatto quadrature formula (see 
§4.5), often used in Clenshaw-Curtis adaptive quadrature (see §5.9, equation 5.9.4). 

Once again the transform can be computed without the factor of two inefficiency. 
In this case the auxiliary function is 


Vi = \lfi + fN-j) - fN-j) 3 = O,...#-! (12.3.19) 

Instead of equation (12.3.15), realft now gives 

F 2k = Rk Fik+i = F 2 k-i + Ik fc = 0,..., (IV/2 — 1) (12.3.20) 

The starting value for the recursion for odd k in this case is 

]V-1 

Fi = -(/o - f N ) + fj cos(jV/iV) (12.3.21) 

3 = 1 

This sum does not appear naturally among the R k and I k , and so we accumulate it 
during the generation of the array yj. 

Once again this transform is its own inverse, and so the following routine 
works for both directions of the transformation. Note that although this form of 
the cosine transform has N + 1 input and output values, it passes an array only 
of length N to realft. 


#include <math.h> 

#define PI 3.141592653589793 


void cosftl(float y[], int n) 

Calculates the cosine transform of a set y [1. .n+1] of real-valued data points. The transformed 
data replace the original data in array y. n must be a power of 2. This program, without 
changes, also calculates the inverse cosine transform, but in this case the output array should 
be multiplied by 2/n. 

t 

void realft(float data[], unsigned long n, int isign); 

int j,n2; 

float sum,yl,y2; 

double theta,wi=0.0,wpi,wpr,wr=l.0,wt emp; 

Double precision for the trigonometric recurrences. 


theta=PI/n; 
wtemp=sin(0.5*theta); 
wpr = -2.0*wtemp*wtemp; 
wpi=sin(theta); 
sum=0.5* (y [1] -y [n+1]); 
y [1] =0.5* (y [1] +y [n+1] ) ; 
n2=n+2; 

for (j=2; j<=(n»l); j++) { 

wr=(wtemp=wr)*wpr-wi*wpi+wr; 
wi=wi*wpr+wtemp*wpi+wi; 
yl=0.5*(y[j]+y[n2-j]) ; 
y2=(y[j]-y[n2-j]); 
y[j]=yl-wi*y2; 
y[n2-j]=yl+wi*y2; 
sum += wr*y2; 


Initialize the recurrence. 


j=n/2+l unnecessary since y[n/2+l] unchanged. 
Carry out the recurrence. 

Calculate the auxiliary function. 

The values for j and N — j are related. 

Carry along this sum for later use in unfold¬ 
ing the transform. 
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realft(y,n,1); 
y [n+l]=y[2] ; 
y [2]=sum; 

for (j=4;j<=n;j+=2) { 
sum += y [j] ; 
y[j]=sum; 

> 

> 


Calculate the transform of the auxiliary func¬ 
tion. 

sum is the value of F\ inequation (12.3.21). 
Equation (12.3.20). 


The second important form of the cosine transform is defined by 


with inverse 


„ ^(i + i 

Fk = 2^ ft cos - 

3 =0 


N 


f 2 \ nk(j + |) 

^=n 2^ FkCOS — N — 


(12.3.22) 


(12.3.23) 


Here the prime on the summation symbol means that the term for k = 0 has a 
coefficient of | in front. This form arises by extending the given data, defined for 
j = 0,..., N — 1, to j = N, . .., 2 N — 1 in such a way that it is even about the point 
N — \ and periodic. (It is therefore also even about j = — \ .) The form (12.3.23) 
is related to Gauss-Chebyshev quadrature (see equation 4.5.19), to Chebyshev 
approximation (§5.8, equation 5.8.7), and Clenshaw-Curtis quadrature (§5.9). 

This form of the cosine transform is useful when solving differential equations 
on “staggered” grids, where the variables are centered midway between mesh points. 
It is also the standard form in the field of data compression and image processing. 

The auxiliary function used in this case is similar to equation (12.3.19): 

Vi = \(ti: + fa-j- 1) + sin 2 \ fi ~ fa-i- if j = 0,... ,N — 1 

(12.3.24) 


Carrying out the steps similar to those used to get from (12.3.12) to (12.3.15), we find 


nk . n k 

Fik = cos —R k - sin —4 

(12.3.25) 

7r k irk 


F 2 k-i = sin — R k + cos —I k + F 2k+ i 

(12.3.26) 


Note that equation (12.3.26) gives 


Fn-1 = \Rn/2 (12.3.27) 

Thus the even components are found directly from (12.3.25), while the odd com¬ 
ponents are found by recursing (12.3.26) down from k = N/2 — 1, using (12.3.27) 
to start. 

Since the transform is not self-inverting, we have to reverse the above steps to 
find the inverse. Here is the routine: 
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#include <math.h> 

#define PI 3.141592653589793 


void cosft2(float y[], int n, int isign) 

Calculates the “staggered" cosine transform of a set y [1. .n] of real-valued data points. The 
transformed data replace the original data in array y. n must be a power of 2. Set isign to 
+1 for a transform, and to —1 for an inverse transform. For an inverse transform, the output 
array should be multiplied by 2/n. 

{ 

void realft(float data[], unsigned long n, int isign); 
int i; 

float sum,suml,yl,y2,ytemp; 

double theta,wi=0.0,wil,wpi,wpr,wr=1.0,wr1,wtemp; 

Double precision for the trigonometric recurrences. 


theta=0.5*PI/n; 
wrl=cos(theta); 
wil=sin(theta); 
wpr = -2.0*wil*wil; 
wpi=sin(2.0*theta); 
if (isign == 1) { 

for (i=l;i<=n/2;i++) { 

yl=0.5*(y[i]+y[n-i+1]) ; 
y2=wil*(y[i]-y[n-i+1]); 
y[i]=yl+y2; 
y[n-i+1]=yl-y2; 

wrl=(wtemp=wrl)*wpr-wil*wpi+wrl; 
wil=wil*wpr+wtemp*wpi+wil; 

> 

realft(y,n,l); 

for (i=3;i<=n;i+=2) { 

wr=(wtemp=wr)*wpr-wi*wpi+wr; 
wi=wi*wpr+wtemp*wpi+wi; 
yl=y[i]*wr-y[i+1]*wi; 
y2=y[i+1]*wr+y [i]*wi; 
y [i] =yl; 
y[i+1]=y2; 

> 

sum=0.5*y[2]; 
for (i=n;i>=2;i-=2) { 
suml=sum; 
sum += y [i] ; 
y[i]=suml; 

> 

} else if (isign == -1) { 
ytemp=y[n] ; 

for (i=n;i>=4;i-=2) y[i]=y[i-2]-y[i] ; 

y[2]=2.0*ytemp; 

for (i=3;i<=n;i+=2) { 

wr=(wtemp=wr)*wpr-wi*wpi+wr; 
wi=wi*wpr+wtemp*wpi+wi; 
y l=y [i] *wr+y [i+1] *wi; 
y2=y[i+1]*wr-y [i]*wi; 

y [i] =yl; 
y[i+l]=y2; 

> 

realft(y,n,-l); 
for (i=l;i<=n/2;i++) { 
yl=y[i]+y [n-i+1] ; 
y2=(0.5/wil)*(y[i]-y[n-i+1]); 
y [i]=0.5*(yl+y2); 
y[n-i+1]=0.5*(yl-y2); 
wrl=(wtemp=wrl)*wpr-wil*wpi+wrl; 
wil=wil*wpr+wtemp*wpi+wil; 


Initialize the recurrences. 


Forward transform. 

Calculate the auxiliary function. 


Carry out the recurrence. 


Transform the auxiliary function. 
Even terms. 


Initialize recurrence for odd terms 
with \Rn/ 2 - 

Carry out recurrence for odd terms. 


Inverse transform. 

Form difference of odd terms. 
Calculate Rk and Ik- 


invert auxiliary array. 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 






12.4 FFT in Two or More Dimensions 


521 


> 


} 

> 


An alternative way of implementing this algorithm is to form an auxiliary 
function by copying the even elements of fj into the first N/2 locations, and the 
odd elements into the next N/2 elements in reverse order. However, it is not easy 
to implement the alternative algorithm without a temporary storage array and we 
prefer the above in-place algorithm. 

Finally, we mention that there exist fast cosine transforms for small N that do 
not rely on an auxiliary function or use an FFT routine. Instead, they carry out the 
transform directly, often coded in hardware for fixed N of small dimension [1 ]. 


CITED REFERENCES AND FURTHER READING: 

Brigham, E.O. 1974, The Fast Fourier Transform (Englewood Cliffs, NJ: Prentice-Hall), §10-10. 
Sorensen, H.V., Jones, D.L., Heideman, M.T., and Burris, C.S. 1987, IEEE Transactions on 
Acoustics, Speech, and Signal Processing, vol. ASSP-35, pp. 849-863. 

Hou, H.S. 1987, IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. ASSP-35, 
pp. 1455-1461 [see for additional references]. 

Hockney, R.W. 1971, in Methods in Computational Physics, vol. 9 (New York: Academic Press). 
Temperton, C. 1980, Journal of Computational Physics, vol. 34, pp. 314-329. 

Clarke, R.J. 1985, Transform Coding of Images, (Reading, MA: Addison-Wesley). 

Gonzalez, R.C., and Wintz, P. 1987, Digital Image Processing, (Reading, MA: Addison-Wesley). 
Chen, W., Smith, C.H., and Fralick, S.C. 1977, IEEE Transactions on Communications, vol. COM- 
25, pp. 1004-1009. [1] 


12.4 FFT in Two or More Dimensions 


Given a complex function h{k\,k 2 ) defined over the two-dimensional grid 
0 < %. < ATi - l t 0 < fc 2 < AT 2 — 1, we can define its two-dimensional discrete 
Fourier transform as a complex function H(m, 712 ), defined over the same grid. 


JV 2 -I iVi-l 

H{n\,n2) = E E exp(2Trik2U2/N2 ) exp(27rifcini/lVi) h{k\,k2) 


(12.4.1) 


By pulling the “subscripts 2” exponential outside of the sum over k \, or by reversing 
the order of summation and pulling the “subscripts 1” outside of the sum over k 2 , 
we can see instantly that the two-dimensional FFT can be computed by taking one¬ 
dimensional FFTs sequentially on each index of the original function. Symbolically, 

H(ri\, ri'i) = FFT-on-index-1 (FFT-on-index-2 [h(k 1 , fc 2 )]) 

(12.4.2) 

= FFT-on-index-2 (FFT-on-index-1 [/i(fci, k 2 )}) 
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For this to be practical, of course, both N i and N 2 should be some efficient length 
for an FFT, usually a power of 2. Programming a two-dimensional FFT, using 
(12.4.2) with a one-dimensional FFT routine, is a bit clumsier than it seems at first. 
Because the one-dimensional routine requires that its input be in consecutive order 
as a one-dimensional complex array, you find that you are endlessly copying things 
out of the multidimensional input array and then copying things back into it. This 
is not recommended technique. Rather, you should use a multidimensional FFT 
routine, such as the one we give below. 

The generalization of (12.4.1) to more than two dimensions, say to L- 
dimensions, is evidently 

N l - 1 JVi-l 

H(ni,... ,n L ) = Y'' •••Y'' exp(2Trik L n L /N L ) x • • • 

fcT^o i^o (12.4.3) 

x exp(27r«A;ini/A^i) h(ki,..., fet,) 

where m and k-\ range from 0 to N%- 1, ... ,hl and k 2j range from 0 to Nl — 1. 
How many calls to a one-dimensional FFT are in (12.4.3)? Quite a few! For each 
value of k{. k 2 ..... kj.-i you FFT to transform the L index. Then for each value of 
k\, k 2 ..... k'L -2 and til you FFT to transform the L — 1 index. And so on. It is 
best to rely on someone else having done the bookkeeping for once and for all. 

The inverse transforms of (12.4.1) or (12.4.3) are just what you would expect 
them to be: Change the i’s in the exponentials to — i’s, and put an overall 
factor of 1 /(iVi x • • • x Nl) in front of the whole thing. Most other features 
of multidimensional FFTs are also analogous to features already discussed in the 
one-dimensional case: 

• Frequencies are arranged in wrap-around order in the transform, but now 
for each separate dimension. 

• The input data are also treated as if they were wrapped around. If they are 
discontinuous across this periodic identification (in any dimension) then 
the spectrum will have some excess power at high frequencies because 
of the discontinuity. The fix, if you care, is to remove multidimensional 
linear trends. 

• If you are doing spatial filtering and are worried about wrap-around effects, 
then you need to zero-pad all around the border of the multidimensional 
array. However, be sure to notice how costly zero-padding is in multidi¬ 
mensional transforms. If you use too thick a zero-pad, you are going to 
waste a lot of storage, especially in 3 or more dimensions! 

• Aliasing occurs as always if sufficient bandwidth limiting does not exist 
along one or more of the dimensions of the transform. 

The routine f ourn that we furnish herewith is a descendant of one written by N. 
M. Brenner. It requires as input (i) a scalar, telling the number of dimensions, e.g., 
2; (ii) a vector, telling the length of the array in each dimension, e.g., (32,64). Note 
that these lengths must all be powers of 2, and are the numbers of complex values 
in each direction; (iii) the usual scalar equal to ±1 indicating whether you want the 
transform or its inverse; and, finally (iv) the array of data. 

A few words about the data array: f ourn accesses it as a one-dimensional array 
of real numbers, that is, data [1. . ( 2N\N% ... Nl )], of length equal to twice the 
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,data [ 11 row of 2AL float numbers 



data \2N x N 2 \-—'' 


Figure 12.4.1. Storage arrangement of frequencies in the output of a two-dimensional FFT. 

The input data is a two-dimensional Ni x IV2 array h(ti,t 2 ) (stored by rows of complex numbers). 
The output is also stored by complex rows. Each row corresponds to a particular value of £, as shown 
in the figure. Within each row, the arrangement of frequencies ji is exactly as shown in Figure 12.2.2. 
Aj and A 2 are the sampling intervals in the 1 and 2 directions, respectively. The total number of (real) 
array elements is 2 N 1 N 2 . The program fourn can also do more than two dimensions, and the storage 
arrangement generalizes in the obvious way. 



product of the lengths of the L dimensions. It assumes that the array represents 
an L-dimensional complex array, with individual components ordered as follows: 
(i) each complex value occupies two sequential locations, real part followed by 
imaginary; (ii) the first subscript changes least rapidly as one goes through the array; 
the last subscript changes most rapidly (that is, “store by rows,” the C norm); (iii) 
subscripts range from 1 to their maximum values (N 1 , N 2 , ■ . ., Nl, respectively), 
rather than from 0 to iVi — 1, N 2 - I...., N L — 1. Almost all failures to get f ourn 
to work result from improper understanding of the above ordering of the data array, 
so take care! (Figure 12.4.1 illustrates the format of the output array.) 

#include <math.h> 

#define SWAP(a,b) tempr=(a);(a)=(b);(b)=tempr 

void fourn(float data[] , unsigned long nn[] , int ndim, int isign) 

Replaces data by its ndim-dimensional discrete Fourier transform, if isign is input as 1. 
nn[l. .ndim] is an integer array containing the lengths of each dimension (number of complex 
values), which MUST all be powers of 2. data is a real array of length twice the product of 
these lengths, in which the data are stored as in a multidimensional complex array: real and 
imaginary parts of each element are in consecutive locations, and the rightmost index of the 
array increases most rapidly as one proceeds along data. For a two-dimensional array, this is 
equivalent to storing the array by rows. If isign is input as —1, data is replaced by its inverse 
transform times the product of the lengths of all dimensions. 
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int idim; 

unsigned long il,i2,i3,i2rev,i3rev,ipl,ip2,ip3,ifpl,ifp2; 
unsigned long ibit,kl,k2,n,nprev,nrem,ntot; 
float tempi,tempr; 

double theta, wi, wpi,wpr,wr,wtemp; Double precision for trigonometric recur¬ 

rences. 

for (ntot=l, idim=l; idim<=ndim; idim++) Compute total number of complex val- 
ntot *= nn[idim] ; ues. 

nprev=l; 

for (idim=ndim; idim>=l; idim—) { Main loop over the dimensions. 

n=nn[idim]; 
nrem=ntot/(n*nprev); 
ipl=nprev « 1; 
ip2=ipl*n; 
ip3=ip2*nrem; 
i2rev=l; 

for (i2=l;i2<=ip2;i2+=ipl) { This is the bit-reversal section of the 

if (i2 < i2rev) { routine, 

for (il=i2;il<=i2+ipl-2;il+=2) { 
for (i3=il;i3<=ip3;i3+=ip2) { 
i3rev=i2rev+i3-i2; 

SWAP (data [i3] ,data[i3rev]); 

SWAP(data[i3+l],data[i3rev+l]); 

> 


ibit=ip2 » 1; 

while (ibit >= ipl && i2rev > ibit) { 
i2rev -= ibit; 
ibit »= 1; 

> 

i2rev += ibit; 

} 

ifpl=ipl; Here begins the Danielson-Lanczos sec- 

while (ifpl < ip2) { tion of the routine. 

ifp2=ifpl « 1; 

theta=isign*6.28318530717959/(ifp2/ipl); Initialize for the trig, recur- 

wtemp=sin(0.5*theta) ; rence. 

wpr = -2.0*wtemp*wtemp; 

wpi=sin(theta); 

wr=l.0; 

wi=0.0; 

for (i3=l;i3<=ifpl;i3+=ipl) { 

for (il=i3;il<=i3+ipl-2;il+=2) { 
for (i2=il;i2<=ip3;i2+=ifp2) { 

kl=i2; Danielson-La nczos formula: 

k2=kl+ifpl; 

tempr=(float)wr*data[k2]-(float)wi*data[k2+l]; 

tempi=(float)wr*data[k2+l]+(float)wi*data[k2]; 

data [k2] =data [kl] -tempr; 

data[k2+l] =data[kl+l] -tempi; 

data[kl] += tempr; 

data[kl+l] += tempi; 

> 

> 

wr=(wtemp=wr)*wpr-wi*wpi+wr; Trigonometric recurrence. 
wi=wi*wpr+wtemp*wpi+wi; 

> 

ifpl=ifp2; 

> 

nprev *= n; 

> 

> 



Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 




12.5 Fourier Transforms of Real Data in Two and Three Dimensions 


525 


CITED REFERENCES AND FURTHER READING: 

Nussbaumer, H.J. 1982, Fast Fourier Transform and Convolution Algorithms (New York: Springer- 
Verlag). 


12.5 Fourier Transforms of Real Data in Two 
and Three Dimensions 

Two-dimensional FFTs are particularly important in the field of image process¬ 
ing. An image is usually represented as a two-dimensional array of pixel intensities, 
real (and usually positive) numbers. One commonly desires to filter high, or low, 
frequency spatial components from an image; or to convolve or deconvolve the 
image with some instrumental point spread function. Use of the FFT is by far the 
most efficient technique. 

In three dimensions, a common use of the FFT is to solve Poisson’s equation 
for a potential (e.g., electromagnetic or gravitational) on a three-dimensional lattice 
that represents the discretization of three-dimensional space. Here the source terms 
(mass or charge distribution) and the desired potentials are also real. In two and 
three dimensions, with large arrays, memory is often at a premium. It is therefore 
important to perform the FFTs, insofar as possible, on the data “in place.” We 
want a routine with functionality similar to the multidimensional FFT routine f ourn 
(§12.4), but which operates on real, not complex, input data. We give such a 
routine in this section. The development is analogous to that of §12.3 leading to the 
one-dimensional routine realf t. (You might wish to review that material at this 
point, particularly equation 12.3.5.) 

It is convenient to think of the independent variables in equation 

(12.4.3) as representing an L-dimensional vector ft in wave-number space, with 
values on the lattice of integers. The transform H(n i,... ,til) is then denoted 

H(ft). 

It is easy to see that the transform H(ft) is periodic in each of its L dimensions. 
Specifically, if Pi, P 2 , P 3 ,... denote the vectors (A^, 0,0,...), (0, N 2 ,0,...), 
(0,0, N 3 ,. ..), and so forth, then 

H(n±Pj) = H(ft) j = (12.5.1) 

Equation (12.5.1) holds for any input data, real or complex. When the data is real, 
we have the additional symmetry 

H(-n) = H(n)* (12.5.2) 

Equations (12.5.1) and (12.5.2) imply that the full transform can be trivially obtained 
from the subset of lattice values ft that have 

0 < ni < Ni — 1 

0 < n 2 < N 2 — 1 



0 <n L < 


Nl 

2 


(12.5.3) 
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In fact, this set of values is overcomplete, because there are additional symmetry 
relations among the transform values that have = 0 and til = NjJ2. However 
these symmetries are complicated and their use becomes extremely confusing. 
Therefore, we will compute our FFT on the lattice subset of equation (12.5.3), 
even though this requires a small amount of extra storage for the answer, i.e., the 
transform is not quite “in place.” (Although an in-place transform is in fact possible, 
we have found it virtually impossible to explain to any user how to unscramble its 
output, i.e., where to find the real and imaginary components of the transform at 
some particular frequency!) 

We will implement the multidimensional real Fourier transform for the three 
dimensional case L = 3, with the input data stored as a real, three-dimensional array 
data[l. .nnl] [ 1 . .nn 2 ] [ 1 . .nn3]. This scheme will allow two-dimensional data 
to be processed with effectively no loss of efficiency simply by choosing nnl = 1 . 
(Note that it must be the first dimension that is set to 1.) The output spectrum comes 
back packaged, logically at least, as a complex, three-dimensional array that we can 
call SPEC [1.. nnl] [l..nn2] [l..nn3/2+l] (cf. equation 12.5.3). In the first two 
of its three dimensions, the respective frequency values f\ or f 2 are stored in wrap¬ 
around order, that is with zero frequency in the first index value, the smallest positive 
frequency in the second index value, the smallest negative frequency in the last index 
value, and so on (cf. the discussion leading up to routines f ourl and f ourn). The 
third of the three dimensions returns only the positive half of the frequency spectrum. 
Figure 12.5.1 shows the logical storage scheme. The returned portion of the complex 
output spectrum is shown as the unshaded part of the lower figure. 

The physical, as opposed to logical, packaging of the output spectrum is neces¬ 
sarily a bit different from the logical packaging, because C does not have a convenient, 
portable mechanism for equivalencing real and complex arrays. The subscript range 
SPEC [1. . nnl] [1. . nn2] [1. . nn3/2] is returned in the input array data [1. . nnl] 
[1. .nn2] [1. .nn3], with the correspondence 

Re(SPEC[il] [12] [13]) = data[il][12][2*13-1] 

Im(SPEC [il] [i2] [13]) =data[il] [i2] [2*i3] 

The remaining “plane” of values, SPEC [1. .nnl] [1. .nn2] [nn3/2+l], is returned 
in the two-dimensional float array speq[l. .nnl] [ 1 . . 2 *nn 2 ],with the corre¬ 
spondence 

Re(SPEC[il][i2][nn3/2+l]) = speq[il] [2*i2-l] 

Im(SPEC[il][i2][nn3/2+l]) = speq[il][2*i2] 

Note that speq contains frequency components whose third component / 3 is at 
the Nyquist critical frequency ±/ c . In some applications these values will in fact 
be ignored or set to zero, since they are intrinsically aliased between positive and 
negative frequencies. 

With this much introduction, the implementing procedure, called rlft3, is 
something of an anticlimax. Look in the innermost loop in the procedure, and you 
will see equation (12.3.5) implemented on the last transform index. The case of 
i3=l is coded separately, to account for the fact that speq is to be filled instead of 
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Input data array 


Output spectrum 
arrays (complex) 

nn2,l nn2,nn3/2 



h=fc 


II 


o 

II 

returned in 
data[1,rnnl] [1 

returned in spec [ 1 



1, 1 

' fi = 0 T.,m3/2 

□ 









h = ~fc 


Figure 12.5.1. Input and output data arrangement for rlft3. All arrays shown are presumed 
to have a first (leftmost) dimension of range [1. .nnl], coming out of the page. The input data 
array is a real, three-dimensional array data[l. .nnl] [1. .nn2] [1. .nn3] . (For two-dimensional 
data, one sets nnl = 1.) The output data can be viewed as a single complex array with dimensions 
[1. .nnl] [1. ,nn2] [1. ,nn3/2+l] (cf. equation 12.5.3), corresponding to the frequency components 
/i and /2 being stored in wrap-around order, but only positive fz values being stored (others being 
obtainable by symmetry). The output data is actually returned mostly in the input array data, but partly 
stored in the real array speq[l. .nnl] [1. .2*nn2] . See text for details. 
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overwriting the input array of data. The three enclosing for loops (indices i2, i3, 
and il, from inside to outside) could in fact be done in any order — their actions all 
commute. We chose the order shown because of the following considerations: (i) i3 
should not be the inner loop, because if it is, then the recurrence relations on wr and 
wi become burdensome, (ii) On virtual-memory machines, il should be the outer 
loop, because (with C order of array storage) this results in the array data, which 
might be very large, being accessed in block sequential order. 

Note that the work done in rlf t3 is quite (logarithmically) small, compared to 
the associated complex FFT, f ourn. Since C does not have a convenient complex 
type, the operations are carried out explicitly below in terms of real and imaginary 
parts. The routine rlf t3 is based on an earlier routine by G.B. Rybicki. 

#include <math.h> 

void rlft3(float ***data, float **speq, unsigned long nnl, unsigned long nn2, 
unsigned long nn3, int isign) 

Given a three-dimensional real array data[l. .nnl] [1. ,nn2] [1. ,nn3] (where nnl = 1 for 
the case of a logically two-dimensional array), this routine returns (for isign=l) the complex 
fast Fourier transform as two complex arrays: On output, data contains the zero and positive 
frequency values of the third frequency component, while speq[l. .nnl] [1. . 2*nn2] contains 
the Nyquist critical frequency values of the third frequency component. First (and second) 
frequency components are stored for zero, positive, and negative frequencies, in standard wrap¬ 
around order. See text for description of how complex values are arranged. For isign=-l, the 
inverse transform (times nnl*nn2*nn3/2 as a constant multiplicative factor) is performed, 
with output data (viewed as a real array) deriving from input data (viewed as complex) and 
speq. For inverse transforms on data not generated first by a forward transform, make sure 
the complex input data array satisfies property (12.5.2). The dimensions nnl, nn2, nn3 must 
always be integer powers of 2. 

{ 

void fournffloat data[] , unsigned long nn[] , int ndim, int isign); 

void nrerrorfchar error_text []); 

unsigned long il,i2,i3,jl,j2,j3,nn[4],ii3; 

double theta,wi,wpi,wpr,wr,wtemp; 

float cl,c2,hlr,hli,h2r ) h2i; 

if (l+&data[nnl] [nn2] [nn3]-&data[l] [1] [1] != nnl*nn2*nn3) 

nrerror( M rlft3: problem with dimensions or contiguity of data array\n"); 
cl=0.5; 

c2 = -0.5*isign; 

theta=isign*(6.28318530717959/nn3); 

wtemp=sin(0.5*theta); 

wpr = -2.0*wtemp*wtemp; 

wpi=sin(theta); 

nn[l] =nnl; 

nn [2] =nn2; 

nn[3]=nn3 » 1; 

if (isign == 1) { Case of forward transform. 

fourn(&data[l] [1] [1]-1 ,nn,3, isign) ; Here is where most all of the com- 

for (il=l; il<=nnl; il++) pute time is spent. 

for (i2=l, j2=0;i2<=nn2;i2++) { Extend data periodically into speq. 

speq[il] [++j2]=data[il] [i2] [1] ; 
speq[il] [++j2]=data[il] [i2] [2] ; 

1 

} 

for (il=l;il<=nnl;il++) { 

j1=(il != 1 ? nnl-il+2 : 1); 

Zero frequency is its own reflection, otherwise locate corresponding negative frequency 
in wrap-around order. 

wr=1.0; Initialize trigonometric recurrence. 

wi=0.0; 

for (ii3=l, i3=l; i3<= (nn3»2) +1; i3++, ii3+=2) { 
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Figure 12.5.2. (a) A two-dimensional image with intensities either purely black or purely white, (b) The 
same image, after it has been low-pass filtered using rlf t3. Regions with fine-scale features become gray. 


> 


for (i2=l;i2<=nn2;i2++) { 

if (i3 == 1) { Equation (12.3.5). 

j2=(i2 ! = 1 ? ((nn2-i2)«l)+3 : 1); 
hlr=cl*(data[il] [i2] [1]+speq[j 1] [j2]); 
hli=cl*(data[il] [i2] [2]-speq[jl] [j2+l]); 
h2i=c2*(data[il] [i2] [l]-speq[jl] [j2] ) ; 
h2r= -c2*(data[il] [i2] [2]+speq[jl] [j2+l]); 
data[il] [i2] [l]=hlr+h2r; 
data[il] [i2] [2] =hli+h2i; 
speq [j 1] [j 2] =hlr-h2r; 

s peq[jl] [j2+l]=h2i-hli; 

> else { 

j2=(i2 != 1 ? nn2-i2+2 : 1); 
j3=nn3+3-(i3«l); 

hlr=cl*(data[il] [i2] [ii3]+data[jl] [j2] [j3]); 
hli=cl*(data[il] [i2] [ii3+l]-data[j 1] [j2] [j3+l]); 
h2i=c2*(data[il] [i2] [ii3]-data[jl] [j2] [j3]); 
h2r= -c2*(data[il] [i2] [ii3+l]+data[jl] [j2] [j3+l]); 
data[il][i2][ii3]=hlr+wr*h2r-wi*h2i; 
data[il][i2][ii3+l]=hli+wr*h2i+wi*h2r; 
data[jl] [j2] [j3]=hlr-wr*h2r+wi*h2i; 
data[jl] [J2] [j3+l]= -hli+wr*h2i+wi*h2r; 

> 

> 

wr=(wtemp=wr)*wpr-wi*wpi+wr; Do the recurrence. 

wi=ui*wpr+wtemp*wpi+wi; 

> 

> 

if (isign == -1) Case of reverse transform. 

fourn(&data[l] [1] [1]-ljnn.S,isign) ; 


We now give some fragments from notional calling programs, to clarify the use 
of rlf t3 for two- and three-dimensional data. Note again that the routine does not 
actually distinguish between two and three dimensions; two is treated like three, but 
with the first dimension having length 1. Since the first dimension is the outer loop, 
virtually no inefficiency is introduced. 

The first program fragment FFTs a two-dimensional data array, allows for some 
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processing on it, e.g., filtering, and then takes the inverse transform. Figure 12.5.2 
shows an example of the use of this kind of code: A sharp image becomes blurry 
when its high-frequency spatial components are suppressed by the factor (here) 
max (1 — 6/ 2 //c, 0). The second program example illustrates a three-dimensional 
transform, where the three dimensions have different lengths. The third program 
example is an example of convolution, as it might occur in a program to compute the 
potential generated by a three-dimensional distribution of sources. 

#include <stdlib.h> 

#include "nrutil.h" 

#def ine N2 256 

#define N3 256 Note that the first component must be set to 1. 

int main(void) /* examplel */ 

This fragment shows how one might filter a 256 by 256 digital image. 

{ 

void rlft3(float ***data, float **speq, unsigned long nnl, 
unsigned long nn2, unsigned long nn3, int isign); 
float ***data, **speq; 

data=f3tensor(1,1,1,N2,1,N3); 
speq=matrix(l,l,l,2*N2); 

/* . . .*/ Here the image would be loaded into data, 

rIft3(data,speq,1,N2,N3,1); 

/* . . .*/ Here the arrays data and speq would be multiplied by a 

rIft3(data,speq, 1,N2,N3,-1) ; suitable filter function (of frequency). 

/* . . .*/ Here the filtered image would be unloaded from data. 

free_matrix(speq,l,l,l,2*N2); 
free.f3tensor(data,1,1,1,N2,1,N3); 
return 0; 

> 

#define N1 32 
#define N2 64 
#define N3 16 

int main(void) /* example2 */ 

This fragment shows how one might FFT a real three-dimensional array of size 32 by 64 by 16. 

i' 

void rlft3(float ***data, float **speq, unsigned long nnl, 
unsigned long nn2, unsigned long nn3, int isign); 
int j; 

float ***data,**speq; 

data=f3tensor(1,N1,1,N2,1,N3); 
speq=matrix(l,Nl,l,2*N2); 

/* ...*/ Here load data, 

rlft3(data,speq,N1,N2,N3,1); 

/* . . .*/ Here unload data and speq. 

free_matrix(speq,l,Nl,l,2*N2); 
free.f3tensor(data,1,N1,1,N2,1,N3); 
return 0; 

> 

#def ine N 32 

int main(void) /* example3 */ 

This fragment shows how one might convolve two real, three-dimensional arrays of size 32 by 
32 by 32, replacing the first array by the result. 

{ 

void rlft3(float ***data, float **speq, unsigned long nnl, 
unsigned long nn2, unsigned long nn3, int isign); 
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int j ; 

float fac,r,i,***datal,***data2,**speql,**speq2,*spl,*sp2; 

datal=f3tensor(1,N,1,N,1,N); 
data2=f3tensor(1,N,1,N,1,N); 
speql=matrix(l ) N,l,2*N); 
speq2=matrix(l,N,1,2*N); 

/* ...*/ 

rlft3(datal ,speql ,N,N,N, 1); FFT both input arrays, 

rlft3(data2,speq2,N,N,N,1); 

fac=2.0/(N*N*N); Factor needed to get normalized inverse, 

spl = fedatal [1] [1] [1] ; 
sp2 = &data2[l] [1] [1] ; 

for (j=l; j<=N*N*N/2;j++) { Note how this can be made a single for-loop in- 

r = spl [0] *sp2 [0] - spl [1] *sp2[l] ; stead of three nested ones by using 

i = spl [0] *sp2[l] + spl [1] *sp2[0] ; the pointers spl and sp2. 

spl[0] = fac*r; 
spl[l] = fac*i; 
spl += 2; 
sp2 += 2; 

> 

spl = fespeql [1] [1] ; 
sp2 = &speq2 [1] [1] ; 
for (j=l;j<=N*N;j++) { 

r = spl [0] *sp2 [0] - spl [1] *sp2 [1] ; 

i = spl [0] *sp2 [1] + spl [1] *sp2 [0] ; 

spl[0] = fac*r; 

spl[l] = fac*i; 

spl += 2; 

sp2 += 2; 

> 

rlft3(datal,speql,N,N,N,-l) ; Inverse FFT the product of the two FFTs. 

/* ...*/ 

free_matrix(speq2,1,N,1,2*N); 
free_matrix(speql,1,N,1,2*N); 
free_f3tensor(data2,l,N,1,N,1,N); 
free_f3tensor(datal,1,N,1,N,1,N); 
return 0; 


To extend rlf t3 to four dimensions, you simply add an additional (outer) nested 
for loop in iO, analogous to the present i 1. (Modifying the routine to do an arbitrary 
number of dimensions, as in f ourn, is a good programming exercise for the reader.) 

CITED REFERENCES AND FURTHER READING: 

Brigham, E.O. 1974, The Fast Fourier Transform (Englewood Cliffs, NJ: Prentice-Hall). 
Swartztrauber, P. N. 1986, Mathematics of Computation, vol. 47, pp. 323-346. 
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12.6 External Storage or Memory-Local FFTs 


Sometime in your life, you might have to compute the Fourier transform of a really 
large data set, larger than the size of your computer’s physical memory. In such a case, 
the data will be stored on some external medium, such as magnetic or optical tape or disk. 
Needed is an algorithm that makes some manageable number of sequential passes through 
the external data, processing it on the fly and outputting intermediate results to other external 
media, which can be read on subsequent passes. 

In fact, an algorithm of just this description was developed by Singleton [1 ] very soon 
after the discovery of the FFT. The algorithm requires four sequential storage devices, each 
capable of holding half of the input data. The first half of the input data is initially on one 
device, the second half on another. 

Singleton’s algorithm is based on the observation that it is possible to bit-reverse 2 M 
values by the following sequence of operations: On the first pass, values are read alternately 
from the two input devices, and written to a single output device (until it holds half the data), 
and then to the other output device. On the second pass, the output devices become input 
devices, and vice versa. Now, we copy two values from the first device, then two values 
from the second, writing them (as before) first to fill one output device, then to fill a second. 
Subsequent passes read 4, 8, etc., input values at a time. After completion of pass M — 1, 
the data are in bit-reverse order. 

Singleton’s next observation is that it is possible to alternate the passes of essentially 
this bit-reversal technique with passes that implement one stage of the Danielson-Lanczos 
combination formula (12.2.3). The scheme, roughly, is this: One starts as before with half 
the input data on one device, half on another. In the first pass, one complex value is read 
from each input device. Two combinations are formed, and one is written to each of two 
output devices. After this “computing” pass, the devices are rewound, and a “permutation” 
pass is performed, where groups of values are read from the first input device and alternately 
written to the first and second output devices; when the first input device is exhausted, the 
second is similarly processed. This sequence of computing and permutation passes is repeated 
M — K — 1 times, where 2 K is the size of internal buffer available to the program. The 
second phase of the computation consists of a final K computation passes. What distinguishes 
the second phase from the first is that, now, the permutations are local enough to do in place 
during the computation. There are thus no separate permutation passes in the second phase. 
In all, there are 2 M — K — 2 passes through the data. 

Here is an implementation of Singleton’s algorithm, based on[1]: 


#include <stdio.h> 

#include <math.h> 

#include "nrutil.h" 

#define KBF 128 

void fourfs(FILE *file[5], unsigned long nn[], int ndim, int isign) 

One- or multi-dimensional Fourier transform of a large data set stored on external media. On 
input, ndim is the number of dimensions, and nn[l. .ndim] contains the lengths of each di¬ 
mension (number of real and imaginary value pairs), which must be powers of two. file [1. .4] 
contains the stream pointers to 4 temporary files, each large enough to hold half of the data. 
The four streams must be opened in the system's "binary” (as opposed to “text") mode. The 
input data must be in C normal order, with its first half stored in file file[l], its second 
half in file [2], in native floating point form. KBF real numbers are processed per buffered 
read or write, isign should be set to 1 for the Fourier transform, to —1 for its inverse. On 
output, values in the array file may have been permuted; the first half of the result is stored in 
file [3] , the second half in file [4] . N.B.: For ndim > 1, the output is stored by columns, 
i.e., not in C normal order; in other words, the output is the transpose of that which would have 
been produced by routine fourn. 

{ 

void fourew(FILE *file[6], int *na, int *nb, int *nc, int *nd); 

unsigned long j,j12,jk,k,kk,n=l,mm,kc=0,kd,ks,kr,nr,ns,nv; 

int cc,na,nb,nc,nd; 



S, § g 

' r “ 5 ; 
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float tempr,tempi,*afa,*afb,*afc; 
double wr,wi,wpr,wpi,wtemp,theta; 
static int mate[5] = {0,2,1,4,3>; 

afa=vector(l,KBF); 
afb=vector(l,KBF); 
afc=vector(l,KBF); 
for (j=l;j<=ndim;j++) { 
n *= nn[j] ; 

if (un[j] <= 1) nrerror("invalid float or wrong ndim in fourfs"); 

> 

nv=l; 

jk=nn[nv] ; 

ns=n/KBF; 
nr=ns » 1; 
kd=KBF » 1; 
ks=n; 

fourew(file,taa,tab,tac,tad); 

The first phase of the transform starts here. 

for (;;) { Start of the computing pass. 

theta=isign*3.141592653589793/(n/mm); 
wtemp=sin(0.5*theta); 
wpr = -2.0*wtemp*wtemp; 
wpi=sin(theta); 
wr=l.0; 
wi=0.0; 
mm »= 1; 

for (j12=1;j12<=2;j12++) { 
kr=0; 
do { 

cc=fread(&af a[l],sizeof(float),KBF,file[na] ); 
if (cc != KBF) nrerror("read error in fourfs"); 
cc=fread(tafb[l],sizeof(float),KBF,file[nb]); 
if (cc != KBF) nrerror("read error in fourfs"); 
for (j=l;j<=KBF;j+=2) { 

tempr=( (float)wr)*af b [j] -((float)wi) *afb [j+1] ; 

tempi=((float)wi)*afb [j] + ((float)wr) *afb [j+1] ; 

afb [j] =af a[j] -tempr; 

afa[j] += tempr; 

afb [j+1] =afa [j+1]-tempi; 

afa[j+l] += tempi; 

> 

kc += kd; 
if (kc == mm) { 
kc=0; 

wr=(wtemp=wr)*wpr-wi*wpi+wr; 
wi=wi*wpr+wtemp*wpi+wi; 

> 

cc=fwrite(&afa[l],sizeof(float),KBF,file [nc]); 
if (cc != KBF) nrerror("write error in fourfs"); 
cc=fwrite(&afb[1],sizeof(float),KBF,file[nd]); 
if (cc != KBF) nrerror("write error in fourfs"); 

} while (++kr < nr); 

if (j12 == 1 && ks != n && ks == KBF) { 
na=mate [na] ; 
nb=na; 

> 

if (nr == 0) break; 

> 

fourew(file,taa,tab,tac,tad); 
jk »= 1; 
while (jk == 1) { 


Start of the permutation pass. 
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jk=nn[++nv] ; 

> 

ks »= 1; 
if (ks > KBF) { 

for (j 12=1; j 12<=2; j 12++) { 

for (kr=l;kr<=ns;kr+=ks/KBF) { 
for (k=l;k<=ks;k+=KBF) { 

cc=fread(&afa[l] ,sizeof(float).KBF,file[na]); 
if (cc != KBF) nrerror("read error in fourfs"); 
cc=fwrite(&afa[l],sizeof(float),KBF,file[nc]); 
if (cc != KBF) nrerror("write error in fourfs"); 

> 

nc=mate[nc]; 

> 

na=mate[na]; 

> 

fourew(file,fcna,fcnb,fcnc,fend); 

> else if (ks == KBF) nb=na; 
else break; 

> 

j=i; 

The second phase of the transform starts here. Now, the remaining permutations are suf¬ 
ficiently local to be done in place, 
for (;;) { 

theta=isign*3.141592653589793/(n/mm); 

wtemp=sin(0.5*theta); 

wpr = -2.0*wtemp*«temp; 

wpi=sin(theta); 

wr=l.0; 

wi=0.0; 

mm »= 1; 

ks=kd; 

kd »= 1; 

for (j12=1;j12<=2;j12++) { 
for (kr=l;kr<=ns;kr++) { 

cc=fread(&afc[1],sizeof(float),KBF,file [na] ) ; 
if (cc != KBF) nrerror("read error in fourfs"); 
kk=l; 
k=ks+l; 
for (;;) { 

tempr=((float)wr)*afc[kk+ks]-((float)wi)*afc[kk+ks+1]; 
tempi=( (float)wi)*afc [kk+ks] + ( (float)wr) *afc [kk+ks+1] ; 
afa[j]=afc [kk] +t empr; 
af b [j] =af c [kk] -tempr; 
afa[++j]=afc[++kk]+tempi; 
af b [j++] =af c [kk++] -tempi; 
if (kk < k) continue; 
kc += kd; 
if (kc == mm) { 
kc=0; 

wr=(wtemp=wr)*wpr-wi*upi+wr; 
wi=wi*wpr+wtemp*wpi+wi; 

> 

kk += ks; 

if (kk > KBF) break; 
else k=kk+ks; 

> 

if (j > KBF) { 

cc=fwrite(&afa[l].sizeof(float),KBF,file [nc]); 
if (cc != KBF) nrerror("write error in fourfs"); 
cc=fwrite(&afb[1],sizeof(float),KBF,file[nd]); 
if (cc != KBF) nrerror("write error in fourfs"); 

Hi 



Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 



12.6 External Storage or Memory-Local FFTs 


535 


> 

na=mate [na] ; 

> 

f ourew(file,&na,&nb,&nc,fend); 
jk »= 1; 

if (jk > 1) continue; 
do { 

if (nv < ndim) jk=nn[++nv] ; 
else { 

free_vector(afc,l,KBF); 
free_vector(afb,1,KBF); 
free_vector(afa,1,KBF); 
return; 

> 

> while (jk == 1); 

> 

> 


#include <stdio.h> 

#define SWAP(a,b) ftemp=(a);(a)=(b);(b)=ftemp 

void fourew(FILE *file[5], int *na, int *nb, int *nc, int *nd) 
Utility used by fourfs. Rewinds and renumbers the four files. 

{ 

int i; 

FILE *ftemp; 

for (i=l;i<=4;i++) rewind(file[i]); 

SWAP (file [2] .file [4]); 

SWAP(file[1].file[3]); 

*na=3; 

*nb=4; 

*nc=l; 

*nd=2; 

> 


For one-dimensional data, Singleton’s algorithm produces output in exactly the same 
order as a standard FFT (e.g., f our 1). For multidimensional data, the output is the transpose of 
the conventional arrangement (e.g., the output of f ourn). This peculiarity, which is intrinsic to 
the method, is generally only a minor inconvenience. For convolutions, one simply computes 
the component-by-component product of two transforms in their nonstandard arrangement, 
and then does an inverse transform on the result. Note that, if the lengths of the different 
dimensions are not all the same, then you must reverse the order of the values in nn [1. . ndim] 
(thus giving the transpose dimensions) before performing the inverse transform. Note also 
that, just like fourn, performing a transform and then an inverse results in multiplying the 
original data by the product of the lengths of all dimensions. 

We leave it as an exercise for the reader to figure out how to reorder fourfs’s output 
into normal order, taking additional passes through the externally stored data. We doubt that 
such reordering is ever really needed. 

You will likely want to modify fourfs to fit your particular application. For example, 
as written, KBF = 2 K plays the dual role of being the size of the internal buffers, and the 
record size of the unformatted reads and writes. The latter role limits its size to that allowed 
by your machine’s I/O facility. It is a simple matter to perform multiple reads for a much 
larger KBF, thus reducing the number of passes by a few. 

Another modification of fourfs would be for the case where your virtual memory 
machine has sufficient address space, but not sufficient physical memory, to do an efficient 
FFT by the conventional algorithm (whose memory references are extremely nonlocal). In 
that case, you will need to replace the reads, writes, and rewinds by mappings of the arrays 
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afa, afb, and afc into your address space. In other words, these arrays are replaced by 
references to a single data array, with offsets that get modified wherever f ourf s performs an 
I/O operation. The resulting algorithm will have its memory references local within blocks 
of size KBF. Execution speed is thereby sometimes increased enormously, albeit at the cost 
of requiring twice as much virtual memory as an in-place FFT. 

CITED REFERENCES AND FURTHER READING: 

Singleton, R.C. 1967, IEEE Transactions on Audio and Electroacoustics, vol. AU-15, pp. 91-97 

[1] 

Oppenheim, A.V., and Schafer, R.W. 1989, Discrete-Time Signal Processing (Englewood Cliffs 
NJ: Prentice-Hall), Chapter 9. 
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Applications 


13.0 Introduction 


Fourier methods have revolutionized fields of science and engineering, from 
radio astronomy to medical imaging, from seismology to spectroscopy. In this 
chapter, we present some of the basic applications of Fourier and spectral methods 
that have made these revolutions possible. 

Say the word “Fourier” to a numericist, and the response, as if by Pavlovian 
conditioning, will likely be “FFT.” Indeed, the wide application of Fourier methods 
must be credited principally to the existence of the fast Fourier transform. Better 
mousetraps stand aside: If you speed up any nontrivial algorithm by a factor of a 
million or so, the world will beat a path towards finding useful applications for it. 
The most direct applications of the FFT are to the convolution or deconvolution of 
data (§13.1), correlation and autocorrelation (§13.2), optimal filtering (§13.3), power 
spectrum estimation (§13.4), and the computation of Fourier integrals (§13.9). 

As important as they are, however, FFT methods are not the be-all and end-all 
of spectral analysis. Section 13.5 is a brief introduction to the field of time-domain 
digital filters. In the spectral domain, one limitation of the FFT is that it always 
represents a function’s Fourier transform as a polynomial in z = exp(2mfA) 
(cf. equation 12.1.7). Sometimes, processes have spectra whose shapes are not 
well represented by this form. An alternative form, which allows the spectrum to 
have poles in z, is used in the techniques of linear prediction (§13.6) and maximum 
entropy spectral estimation (§13.7). 

Another significant limitation of all FFT methods is that they require the input 
data to be sampled at evenly spaced intervals. For irregularly or incompletely 
sampled data, other (albeit slower) methods are available, as discussed in §13.8. 

So-called wavelet methods inhabit a representation of function space that is 
neither in the temporal, nor in the spectral, domain, but rather something in-between. 
Section 13.10 is an introduction to this subject. Finally §13.11 is an exclusion into 
numerical use of the Fourier sampling theorem. 
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13.1 Convolution and Deconvolution Using 
the FFT 


We have defined the convolution of two functions for the continuous case in 
equation (12.0.8), and have given the convolution theorem as equation (12.0.9). The 
theorem says that the Fourier transform of the convolution of two functions is equal 
to the product of their individual Fourier transforms. Now, we want to deal with 
the discrete case. We will mention first the context in which convolution is a useful 
procedure, and then discuss how to compute it efficiently using the FFT. 

The convolution of two functions r(t) and s(t), denoted r * s, is mathematically 
equal to their convolution in the opposite order, s * r. Nevertheless, in most 
applications the two functions have quite different meanings and characters. One of 
the functions, say s, is typically a signal or data stream, which goes on indefinitely 
in time (or in whatever the appropriate independent variable may be). The other 
function r is a “response function,” typically a peaked function that falls to zero in 
both directions from its maximum. The effect of convolution is to smear the signal 
s(t) in time according to the recipe provided by the response function r(t), as shown 
in Figure 13.1.1. In particular, a spike or delta-function of unit area in s which occurs 
at some time to is supposed to be smeared into the shape of the response function 
itself, but translated from time 0 to time to as r(t — to). 

In the discrete case, the signal s(t) is represented by its sampled values at equal 
time intervals Sj . The response function is also a discrete set of numbers r k, with the 
following interpretation: ro tells what multiple of the input signal in one channel (one 
particular value of j ) is copied into the identical output channel (same value of j); 
n tells what multiple of input signal in channel j is additionally copied into output 
channel j + 1 ; r_i tells the multiple that is copied into channel j — 1; and so on for 
both positive and negative values of k in r. Figure 13.1.2 illustrates the situation. 

Example: a response function with ro = 1 and all other r^’s equal to zero 
is just the identity filter: convolution of a signal with this response function gives 
identically the signal. Another example is the response function with r u = 1.5 and 
all other tvs equal to zero. This produces convolved output that is the input signal 
multiplied by 1.5 and delayed by 14 sample intervals. 

Evidently, we have just described in words the following definition of discrete 
convolution with a response function of finite duration M: 

M/2 

(r*s)j= ^2 Sj-kVk (13.1.1) 

k=-M/ 2+1 

If a discrete response function is nonzero only in some range —M/2 < k < M/2, 
where M is a sufficiently large even integer, then the response function is called a 
finite impulse response (FIR), and its duration is M. (Notice that we are defining M 
as the number of nonzero values of r^; these values span a time interval of M — 1 
sampling times.) In most practical circumstances the case of finite M is the case of 
interest, either because the response really has a finite duration, or because we choose 
to truncate it at some point and approximate it by a finite-duration response function. 

The discrete convolution theorem is this: If a signal Sj is periodic with period 
N, so that it is completely determined by the N values Sq> • • •, sjv-i» then its 
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Figure 13.1.1. Example of the convolution of two functions. A signal s(t) is convolved with a 
response function r(t). Since the response function is broader than some features in the original signal, 
these are “washed out” in the convolution. In the absence of any additional noise, the process can be 
reversed by deconvolution. 
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discrete convolution with a response function of finite duration N is a member of 
the discrete Fourier transform pair, 

N/2 

Y, 4i-k n <*=► S n R n (13.1.2) 

k=-N/ 2+1 

Here S n , (n = 0,..., N — 1) is the discrete Fourier transform of the values 
sj , (j = 0,.... N — 1), while R n , (n = 0,..., N — 1) is the discrete Fourier 
transform of the values r*,, (k = 0,..., N — 1). These values of ry. are the same 
ones as for the range k = —N/2 + 1,..., N/2, but in wrap-around order, exactly 
as was described at the end of §12.2. 

Treatment of End Effects by Zero Padding 

The discrete convolution theorem presumes a set of two circumstances that 
are not universal. First, it assumes that the input signal is periodic, whereas real 
data often either go forever without repetition or else consist of one nonperiodic 
stretch of finite length. Second, the convolution theorem takes the duration of the 
response to be the same as the period of the data; they are both N. We need to 
work around these two constraints. 

The second is very straightforward. Almost always, one is interested in a 
response function whose duration M is much shorter than the length of the data 
set N. In this case, you simply extend the response function to length N by 
padding it with zeros, i.e., define ru = 0 for M/2 < k < N/2 and also for 
—N/2 + 1 < k < —M/2 + 1. Dealing with the first constraint is more challenging. 
Since the convolution theorem rashly assumes that the data are periodic, it will 
falsely “pollute” the first output channel (r * s)o with some wrapped-around data 
from the far end of the data stream sn - i , sn - 2 , etc. (See Figure 13.1.3.) So, 
we need to set up a buffer zone of zero-padded values at the end of the s j vector, 
in order to make this pollution zero. How many zero values do we need in this 
buffer? Exactly as many as the most negative index for which the response function 
is nonzero. For example, if r _3 is nonzero, while r_ 4 , r_s,... are all zero, then we 
need three zero pads at the end of the data: sjv -3 = sjv -2 = sjv-i = 0. These 
zeros will protect the first output channel (r * s) 0 from wrap-around pollution. It 
should be obvious that the second output channel (r * s )1 and subsequent ones will 
also be protected by these same zeros. Let K denote the number of padding zeros, 
so that the last actual input data point is sn - k - i- 

What now about pollution of the very last output channel? Since the data 
now end with sn - k - i , the last output channel of interest is (r * s) n - k - i- This 
channel can be polluted by wrap-around from input channel s 0 unless the number 
K is also large enough to take care of the most positive index k for which the 
response function t'k is nonzero. For example, if ro through r§ are nonzero, while 
r 7 ,rg... are all zero, then we need at least K = 6 padding zeros at the end of the 
data: sjv -6 = ■ ■ - = sjv-i = 0 . 

To summarize — we need to pad the data with a number of zeros on one 
end equal to the maximum positive duration or maximum negative duration of the 
response function, whichever is larger. (For a symmetric response function of 
duration M, you will need only M/2 zero pads.) Combining this operation with the 
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response function 




Figure 13.1.3. The wrap-around problem in convolving finite segments of a function. Not only must 
the response function wrap be viewed as cyclic, but so must the sampled original function. Therefore 
a portion at each end of the original function is erroneously wrapped around by convolution with the 
response function. 



response function 





Figure 13.1.4. Zero padding as solution to the wrap-around problem. The original function is extended 
by zeros, serving a dual purpose: When the zeros wrap around, they do not disturb the true convolution; 
and while the original function wraps around onto the zero region, that region can be discarded. 
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padding of the response rk described above, we effectively insulate the data from 
artifacts of undesired periodicity. Figure 13.1.4 illustrates matters. 

Use of FFT for Convolution 

The data, complete with zero padding, are now a set of real numbers Sj, j = 
0,..., N — 1, and the response function is zero padded out to duration N and 
arranged in wrap-around order. (Generally this means that a large contiguous section 
of the rk s, in the middle of that array, is zero, with nonzero values clustered at 
the two extreme ends of the array.) You now compute the discrete convolution as 
follows: Use the FFT algorithm to compute the discrete Fourier transform of s and of 
r. Multiply the two transforms together component by component, remembering that 
the transforms consist of complex numbers. Then use the FFT algorithm to take the 
inverse discrete Fourier transform of the products. The answer is the convolution r * s. 

What about deconvolution ? Deconvolution is the process of undoing the 
smearing in a data set that has occurred under the influence of a known response 
function, for example, because of the known effect of a less-than-perfect measuring 
apparatus. The defining equation of deconvolution is the same as that for convolution, 
namely (13.1.1), except now the left-hand side is taken to be known, and (13.1.1) is 
to be considered as a set of N linear equations for the unknown quantities s j . Solving 
these simultaneous linear equations in the time domain of (13.1.1) is unrealistic in 
most cases, but the FFT renders the problem almost trivial. Instead of multiplying 
the transform of the signal and response to get the transform of the convolution, we 
just divide the transform of the (known) convolution by the transform of the response 
to get the transform of the deconvolved signal. 

This procedure can go wrong mathematically if the transform of the response 
function is exactly zero for some value R n , so that we can’t divide by it. This 
indicates that the original convolution has truly lost all information at that one 
frequency, so that a reconstruction of that frequency component is not possible. 
You should be aware, however, that apart from mathematical problems, the process 
of deconvolution has other practical shortcomings. The process is generally quite 
sensitive to noise in the input data, and to the accuracy to which the response function 
rk is known. Perfectly reasonable attempts at deconvolution can sometimes produce 
nonsense for these reasons. In such cases you may want to make use of the additional 
process of optimal filtering, which is discussed in §13.3. 

Here is our routine for convolution and deconvolution, using the FFT as 
implemented in fourl of §12.2. Since the data and response functions are real, 
not complex, both of their transforms can be taken simultaneously using twofft. 
Note, however, that two calls to realf t should be substituted if data and respns 
have very different magnitudes, to minimize roundoff. The data are assumed to be 
stored in a float array data [1. . n], with n an integer power of two. The response 
function is assumed to be stored in wrap-around order in a sub-array respns [1. . m] 
of the array respns [1. . n]. The value of m can be any odd integer less than or equal 
to n, since the first thing the program does is to recopy the response function into the 
appropriate wrap-around order in respns [1. . n]. The answer is provided in ans. 

#include "nrutil.h" 



void convlv(float data[], unsigned long n, float respns[], unsigned long m, 
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int isign, float ans[]) 

Convolves or deconvolves a real data set data[l. .n] (including any user-supplied zero padding) 
with a response function respns [1. . n] . The response function must be stored in wrap-around 
order in the first m elements of respns, where m is an odd integer < n. Wrap-around order 
means that the first half of the array respns contains the impulse response function at positive 
times, while the second half of the array contains the impulse response function at negative times, 
counting down from the highest element respns [m] . On input isign is +1 for convolution, 

— 1 for deconvolution. The answer is returned in the first n components of ans. However, 
ans must be supplied in the calling program with dimensions [1. . 2*n] , for consistency with 
twofft. n MUST be an integer power of two. 

{ 

void realft(float data[], unsigned long n, int isign); 
void twofftffloat datal[], float data2[], float fftl[], float fft2[], 
unsigned long n); 
unsigned long i,no2; 
float dum,mag2,*fft; 

fft=vector(l,n«l) ; 

for (i=l;i<=(m-l)/2;i++) Put respns in array of length n. 

respns[n+l-i]=respns[m+l-i]; 
for (i=(m+3)/2;i<=n-(m-l)/2;i++) Pad with zeros, 

respns[i]=0.0; 

twofft (data,respns ,fft,ans,n); FFT both at once. 

no2=n»i; 

for (i=2;i<=n+2;i+=2) { 
if (isign == 1) { 

ans [i-1] = (fft [i-1] *(dum=ans [i-1] )-fft [i] *ans [i] )/no2; Multiply FFTs 
ans [i] = (f f t [i] *dum+f f t [i-1] *ans [i] ) /no2; to convolve. 

} else if (isign == -1) { 

if ((mag2=SQR(ans[i-1])+SQR(ans[i])) == 0.0) 

nrerror("Deconvolving at response zero in convlv"); 
ans [i-1] = (f ft [i-1] *(dum=ans [i-1] )+ff t [i] *ans [i] )/mag2/no2; Divide FFTs 
ans [i] = (f ft [i] *dum-f ft [i-1] *ans [i] )/mag2/no2; to deconvolve. 

> else nrerror("No meaning for isign in convlv"); 

> 

ans [2] =ans [n+1] ; Pack last element with first for realft. 

realft(ans,n,-l) ; Inverse transform back to time domain. 

free_vector(fft,l,n«l); 


Convolving or Deconvolving Very Large Data Sets 

If your data set is so long that you do not want to fit it into memory all at 
once, then you must break it up into sections and convolve each section separately. 
Now, however, the treatment of end effects is a bit different. You have to worry 
not only about spurious wrap-around effects, but also about the fact that the ends of 
each section of data should have been influenced by data at the nearby ends of the 
immediately preceding and following sections of data, but were not so influenced 
since only one section of data is in the machine at a time. 

There are two, related, standard solutions to this problem. Both are fairly 
obvious, so with a few words of description here, you ought to be able to implement 
them for yourself. The first solution is called the overlap-save method. In this 
technique you pad only the very beginning of the data with enough zeros to avoid 
wrap-around pollution. After this initial padding, you forget about zero padding 
altogether. Bring in a section of data and convolve or deconvolve it. Then throw out 
the points at each end that are polluted by wrap-around end effects. Output only the 
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Figure 13.1.5. The overlap-add method for convolving a response with a very long signal. The signal 
data is broken up into smaller pieces. Each is zero padded at both ends and convolved (denoted by 
bold arrows in the figure). Finally the pieces are added back together, including the overlapping regions 
formed by the zero pads. 


remaining good points in the middle. Now bring in the next section of data, but not 
all new data. The first points in each new section overlap the last points from the 
preceding section of data. The sections must be overlapped sufficiently so that the 
polluted output points at the end of one section are recomputed as the first of the 
unpolluted output points from the subsequent section. With a bit of thought you can 
easily determine how many points to overlap and save. 

The second solution, called the overlap-add method, is illustrated in Figure 
13.1.5. Here you don’t overlap the input data. Each section of data is disjoint from 
the others and is used exactly once. However, you carefully zero-pad it at both ends 
so that there is no wrap-around ambiguity in the output convolution or deconvolution. 
Now you overlap and add these sections of output. Thus, an output point near the 
end of one section will have the response due to the input points at the beginning 
of the next section of data properly added in to it, and likewise for an output point 
near the beginning of a section, mutatis mutandis. 

Even when computer memory is available, there is some slight gain in computing 
speed in segmenting a long data set, since the FFTs’ N log 2 N is slightly slower than 
linear in N. However, the log term is so slowly varying that you will often be much 
happier to avoid the bookkeeping complexities of the overlap-add or overlap-save 
methods: If it is practical to do so, just cram the whole data set into memory and 
FFT away. Then you will have more time for the finer things in life, some of which 
are described in succeeding sections of this chapter. 



CITED REFERENCES AND FURTHER READING: 

Nussbaumer, H.J. 1982, Fast Fourier Transform and Convolution Algorithms (New York: Springer- 
Verlag). 
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Brigham, E.O. 1974, The Fast Fourier Transform (Englewood Cliffs, NJ: Prentice-Hall), Chap¬ 
ter 13. 


13.2 Correlation and Autocorrelation Using 
the FFT 


Correlation is the close mathematical cousin of convolution. It is in some 
ways simpler, however, because the two functions that go into a correlation are not 
as conceptually distinct as were the data and response functions that entered into 
convolution. Rather, in correlation, the functions are represented by different, but 
generally similar, data sets. We investigate their “correlation,” by comparing them 
both directly superposed, and with one of them shifted left or right. 

We have already defined in equation (12.0.10) the correlation between two 
continuous functions g(t ) and h{t), which is denoted Corr (g, h), and is a function 
of lag t. We will occasionally show this time dependence explicitly, with the rather 
awkward notation Corr(g, h) ( t) . The correlation will be large at some value of t if the 
first function (g) is a close copy of the second (h) but lags it in time by t, i.e., if the first 
function is shifted to the right of the second. Likewise, the correlation will be large 
for some negative value of t if the first function leads the second, i.e., is shifted to the 
left of the second. The relation that holds when the two functions are interchanged is 

Corr(g, h){t) = Corr (h,g)(-t) (13.2.1) 

The discrete correlation of two sampled functions gk and hk, each periodic 
with period N, is defined by 


N -1 

Corr (g,h)j = ^ ' 5j+fc^fc 
fc=o 


(13.2.2) 


The discrete correlation theorem says that this discrete correlation of two real 
functions g and h is one member of the discrete Fourier transform pair 

Corr (g,h) j ^G k H k * (13.2.3) 

where Gk and Hk are the discrete Fourier transforms of g t and h :) , and the asterisk 
denotes complex conjugation. This theorem makes the same presumptions about the 
functions as those encountered for the discrete convolution theorem. 

We can compute correlations using the FFT as follows: FFT the two data sets, 
multiply one resulting transform by the complex conjugate of the other, and inverse 
transform the product. The result (call it r^) will formally be a complex vector 
of length N. However, it will turn out to have all its imaginary parts zero since 
the original data sets were both real. The components of r are the values of the 
correlation at different lags, with positive and negative lags stored in the by now 
familiar wrap-around order: The correlation at zero lag is in r o, the first component; 
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the correlation at lag 1 is in n, the second component; the correlation at lag —1 
is in rjv-i, the last component; etc. 

Just as in the case of convolution we have to consider end effects, since our 
data will not, in general, be periodic as intended by the correlation theorem. Here 
again, we can use zero padding. If you are interested in the correlation for lags as 
large as ±K, then you must append a buffer zone of K zeros at the end of both 
input data sets. If you want all possible lags from N data points (not a usual thing), 
then you will need to pad the data with an equal number of zeros; this is the extreme 
case. So here is the program: 


#include "nrutil.h" 

void correl(float datal[], float data2[], unsigned long n, float ans[]) 

Computes the correlation of two real data sets datal [1. .n] and data2 [1. .n] (including any 
user-supplied zero padding), n MUST be an integer power of two. The answer is returned as 
the first n points in ans[l. . 2*n] stored in wrap-around order, i.e., correlations at increasingly 
negative lags are in ans [n] on down to ans [n/2+1] , while correlations at increasingly positive 
lags are in ans [1] (zero lag) on up to ans [n/2] . Note that ans must be supplied in the calling 
program with length at least 2*n, since it is also used as working space. Sign convention of 
this routine: if datal lags data2, i.e., is shifted to the right of it, then ans will show a peak 
at positive lags. 

{ 

void realft(float data[], unsigned long n, int isign); 
void twofftffloat datal[], float data2[], float fftl[], float fft2[], 
unsigned long n); 
unsigned long no2,i; 
float dum,*fft; 

fft=vector(l,n«l) ; 

twof ft (datal ,data2,f ft,ans,n); Transform both data vectors at once. 

no2=n»l; Normalization for inverse FFT. 

for (i=2;i<=n+2;i+=2) { 

ans [i-l] = (fft [i-1] *(dum=ans [i-1] )+fft [i] *ans [i] )/no2; Multiply to find 

ans [i] = (fft [i] *dum-fft [i-1] *ans [i] )/no2; FFT oftheircor- 

> relation, 

ans [2] =ans [n+1] ; Pack first and last into one element. 

realft(ans,n,-l) ; Inverse transform gives correlation. 

free_vector(fft,l,n«l); 


As in convlv, it would be better to substitute two calls to realf t for the one 
call to twof ft, if datal and data2 have very different magnitudes, to minimize 
roundoff error. 

The discrete autocorrelation of a sampled function (j 3 is just the discrete 
correlation of the function with itself. Obviously this is always symmetric with 
respect to positive and negative lags. Feel free to use the above routine correl 
to obtain autocorrelations, simply calling it with the same data vector in both 
arguments. If the inefficiency bothers you, routine realf t can, of course, be used 
to transform the data vector instead. 



CITED REFERENCES AND FURTHER READING: 

Brigham, E.O. 1974, The Fast Fourier Transform (Englewood Cliffs, NJ: Prentice-Hall), §13-2. 
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There are a number of other tasks in numerical processing that are routinely 
handled with Fourier techniques. One of these is filtering for the removal of noise 
from a “corrupted” signal. The particular situation we consider is this: There is some 
underlying, uncorrupted signal u(t) that we want to measure. The measurement 
process is imperfect, however, and what comes out of our measurement device is a 
corrupted signal c(t). The signal c(t) may be less than perfect in either or both of 
two respects. First, the apparatus may not have a perfect “delta-function” response, 
so that the true signal u(t) is convolved with (smeared out by) some known response 
function r(t) to give a smeared signal s(t'), 

s(t) = J r(t — t)u(t) dr or S(f) = R(f)U(f) (13.3.1) 

where S, R, U are the Fourier transforms of s, r, u, respectively. Second, the 
measured signal c(t ) may contain an additional component of noise n(t), 

c(t) = s(t) + n(t) (13.3.2) 


We already know how to deconvolve the effects of the response function r in 
the absence of any noise (§13.1); we just divide C(f) by R(f) to get a deconvolved 
signal. We now want to treat the analogous problem when noise is present. Our 
task is to find the optimal filter, <p(t) or <&(/), which, when applied to the measured 
signal c(t) or C(f), and then deconvolved by r{t) or R(f), produces a signal u(t) 
or U(f) that is as close as possible to the uncorrupted signal u(t') or U (/). In other 
words we will estimate the true signal U by 


U(f) = 


c(fMf) 
R(f ) 


(13.3.3) 


In what sense is U to be close to U? We ask that they be close in the 
least-square sense 

J \u(t) — u(t )\ 2 dt = J |[/(/) — U(f) | df is minimized. (13.3.4) 


Substituting equations (13.3.3) and (13.3.2), the right-hand side of (13.3.4) becomes 


/: 

/: 


\ [S(f) + N(f)Wf) S(f) I 2 

I «(/) R{f)\ 1 

\R(f)\- 2 {\S(f)\ 2 |1 - $(/)I 2 + |iV(/)| 2 |4>(/)| 2 } df 


(13.3.5) 


The signal S and the noise N are uncorrelated, so their cross product, when 
integrated over frequency /, gave zero. (This is practically the definition of what we 
mean by noise!) Obviously (13.3.5) will be a minimum if and only if the integrand 
is minimized with respect to <&(/) at every value of /. Let us search for such a 
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solution where <&(/) is a real function. Differentiating with respect to <f>, and setting 
the result equal to zero gives 


*(/) = 


\S(f)\ 2 + \N(f)\ 2 


(13.3.6) 


This is the formula for the optimal filter #(/). 

Notice that equation (13.3.6) involves S, the smeared signal, and N, the noise. 
The two of these add up to be C, the measured signal. Equation (13.3.6) does not 
contain U, the “true” signal. This makes for an important simplification: The optimal 
filter can be determined independently of the determination of the deconvolution 
function that relates S and U. 

To determine the optimal filter from equation (13.3.6) we need some way 
of separately estimating |£| 2 and |iV| 2 . There is no way to do this from the 
measured signal C alone without some other information, or some assumption or 
guess. Luckily, the extra information is often easy to obtain. For example, we 
can sample a long stretch of data c(t) and plot its power spectral density using 
equations (12.0.14), (12.1.8), and (12.1.5). This quantity is proportional to the sum 
|S| 2 + |AT| 2 , so we have 


\S(f)\ 2 + \N(f)\ 2 *P c (f) = \C(f)\ 2 0 </</ c (13.3.7) 


(More sophisticated methods of estimating the power spectral density will be 
discussed in § 13.4 and §13.7, but the estimation above is almost always good enough 
for the optimal filter problem.) The resulting plot (see Figure 13.3.1) will often 
immediately show the spectral signature of a signal sticking up above a continuous 
noise spectrum. The noise spectrum may be flat, or tilted, or smoothly varying; it 
doesn’t matter, as long as we can guess a reasonable hypothesis as to what it is. 
Draw a smooth curve through the noise spectrum, extrapolating it into the region 
dominated by the signal as well. Now draw a smooth curve through the signal plus 
noise power. The difference between these two curves is your smooth “model” of the 
signal power. The quotient of your model of signal power to your model of signal 
plus noise power is the optimal filter $(/). [Extend it to negative values of / by the 
formula <&(—/) = <&(/).] Notice that <&(/) will be close to unity where the noise 
is negligible, and close to zero where the noise is dominant. That is how it does its 
job! The intermediate dependence given by equation (13.3.6) just turns out to be the 
optimal way of going in between these two extremes. 

Because the optimal filter results from a minimization problem, the quality of 
the results obtained by optimal filtering differs from the true optimum by an amount 
that is second order in the precision to which the optimal filter is determined. In other 
words, even a fairly crudely determined optimal filter (sloppy, say, at the 10 percent 
level) can give excellent results when it is applied to data. That is why the separation 
of the measured signal C into signal and noise components S and N can usefully be 
done “by eye” from a crude plot of power spectral density. All of this may give you 
thoughts about iterating the procedure we have just described. For example, after 
designing a filter with response <&(/) and using it to make a respectable guess at the 
signal U(f) = /R(f), you might turn about and regard U(f) as a fresh 

new signal which you could improve even further with the same filtering technique. 
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Figure 13.3.1. Optimal (Wiener) filtering. The power spectrum of signal plus noise shows a signal peak 
added to a noise tail. The tail is extrapolated back into the signal region as a “noise model.” Subtracting 
gives the “signal model.” The models need not be accurate for the method to be useful. A simple algebraic 
combination of the models gives the optimal filter (see text). 

Don’t waste your time on this line of thought. The scheme converges to a signal of 
S(f) = 0. Converging iterative methods do exist; this just isn’t one of them. 

You can use the routine fourl (§12.2) or realft (§12.3) to FFT your data 
when you are constructing an optimal filter. To apply the filter to your data, you 
can use the methods described in §13.1. The specific routine convlv is not needed 
for optimal filtering, since your filter is constructed in the frequency domain to 
begin with. If you are also deconvolving your data with a known response function, 
however, you can modify convlv to multiply by your optimal filter just before it 
takes the inverse Fourier transform. 


CITED REFERENCES AND FURTHER READING: 

Rabiner, L.R., and Gold, B. 1975, Theory and Application of Digital Signal Processing (Englewood 
Cliffs, NJ: Prentice-Hall). 

Nussbaumer, H.J. 1982, Fast Fourier Transform and Convolution Algorithms (New York: Springer- 
Verlag). 

Elliott, D.F., and Rao, K.R. 1982, Fast Transforms: Algorithms, Analyses, Applications (New York: 
Academic Press). 



13.4 Power Spectrum Estimation Using the FFT 


In the previous section we “informally” estimated the power spectral density of a 
function c(t ) by taking the modulus-squared of the discrete Fourier transform of some 
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finite, sampled stretch of it. In this section we’ll do roughly the same thing, but with 
considerably greater attention to details. Our attention will uncover some surprises. 

The first detail is power spectrum (also called a power spectral density or 
PSD) normalization. In general there is some relation of proportionality between a 
measure of the squared amplitude of the function and a measure of the amplitude 
of the PSD. Unfortunately there are several different conventions for describing 
the normalization in each domain, and many opportunities for getting wrong the 
relationship between the two domains. Suppose that our function eit) is sampled at 
N points to produce values Co ... cjv-i, and that these points span a range of time 
T, that is T = (N — 1) A, where A is the sampling interval. Then here are several 
different descriptions of the total power: 


JV-1 

^ c-j | 2 = “sum squared amplitude” 

3=0 


1 

T 


m\ 2 


dt 


1 

N 


N-l 


E M 2 


“mean squared amplitude” 


t N -1 

/ |c(f )| 2 di«A £|c/ 
Jo i=o 


“time-integral squared amplitude” 


(13.4.1) 

(13.4.2) 

(13.4.3) 



PSD estimators, as we shall see, have an even greater variety. In this section, 
we consider a class of them that give estimates at discrete values of frequency /*, 
where i will range over integer values. In the next section, we will learn about 
a different class of estimators that produce estimates that are continuous functions 
of frequency /. Even if it is agreed always to relate the PSD normalization to a 
particular description of the function normalization (e.g., 13.4.2), there are at least 
the following possibilities: The PSD is 

• defined for discrete positive, zero, and negative frequencies, and its sum 
over these is the function mean squared amplitude 

• defined for zero and discrete positive frequencies only, and its sum over 
these is the function mean squared amplitude 

• defined in the Nyquist interval from — f c to f c , and its integral over this 
range is the function mean squared amplitude 

• defined from 0 to f c , and its integral over this range is the function mean 
squared amplitude 

It never makes sense to integrate the PSD of a sampled function outside of the 
Nyquist interval —f c and f c since, according to the sampling theorem, power there 
will have been aliased into the Nyquist interval. 

It is hopeless to define enough notation to distinguish all possible combinations 
of normalizations. In what follows, we use the notation P(f) to mean any of the 
above PSDs, stating in each instance how the particular P(f) is normalized. Beware 
the inconsistent notation in the literature. 

The method of power spectrum estimation used in the previous section is a 
simple version of an estimator called, historically, the periodogram. If we take an 
Appoint sample of the function c(t) at equal intervals and use the FFT to compute 
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its discrete Fourier transform 

JV-l 

C k = ^2 Cj e 2nijk/N k = 0,. .., N - 1 (13.4.4) 

3=0 

then the periodogram estimate of the power spectrum is defined at N/2 + 1 
frequencies as 

P(0) = P(/o) = ^|C 0 | 2 

P(fk) = -^[\C k f + \C N - k \ 2 ] fc = l,2,...,(y-l) (13.4.5) 

P(fc) = PUn/2) = ^2 \Cn/2\ 2 

where f k is defined only for the zero and positive frequencies 

f“ s ws =2f 4 ‘- 0 - 1 -*! (i3A6) 

By Parseval’s theorem, equation (12.1.10), we see immediately that equation (13.4.5) 
is normalized so that the sum of the N/2 + 1 values of P is equal to the mean 
squared amplitude of the function c s . 

We must now ask this question. In what sense is the periodogram estimate 
(13.4.5) a “true” estimator of the power spectrum of the underlying function c(t)? 
You can find the answer treated in considerable detail in the literature cited (see, 
e.g., [1 ] for an introduction). Here is a summary. 

First, is the expectation value of the periodogram estimate equal to the power 
spectrum, i.e., is the estimator correct on average? Well, yes and no. We wouldn’t 
really expect one of the P(f k )’s to equal the continuous P(f) at exactly f k , since f k 
is supposed to be representative of a whole frequency “bin” extending from halfway 
from the preceding discrete frequency to halfway to the next one. We should be 
expecting the P(f k ) to be some kind of average of P(/) over a narrow window 
function centered on its f k . For the periodogram estimate (13.4.6) that window 
function, as a function of s the frequency offset in bins, is 



W(s) 


1 [ sin(7rs) 

N 2 sin(7rs/lV) 


(13.4.7) 


Notice that W(s) has oscillatory lobes but, apart from these, falls off only about as 
W ( s ) « (7ts) -2 . This is not a very rapid fall-off, and it results in significant leakage 
(that is the technical term) from one frequency to another in the periodogram estimate. 
Notice also that W ( s ) happens to be zero for s equal to a nonzero integer. This means 
that if the function c(f) is a pure sine wave of frequency exactly equal to one of the 
f k s, then there will be no leakage to adjacent /Vs. But this is not the characteristic 
case! If the frequency is, say, one-third of the way between two adjacent /Vs, then 
the leakage will extend well beyond those two adjacent bins. The solution to the 
problem of leakage is called data windowing, and we will discuss it below. 



Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 



552 


Chapter 13. Fourier and Spectral Applications 


Turn now to another question about the periodogram estimate. What is the 
variance of that estimate as N goes to infinity? In other words, as we take more 
sampled points from the original function (either sampling a longer stretch of data at 
the same sampling rate, or else by resampling the same stretch of data with a faster 
sampling rate), then how much more accurate do the estimates Pk become? The 
unpleasant answer is that the periodogram estimates do not become more accurate 
at all! In fact, the variance of the periodogram estimate at a frequency fk is always 
equal to the square of its expectation value at that frequency. In other words, the 
standard deviation is always 100 percent of the value, independent of N\ How can 
this be? Where did all the information go as we added points? It all went into 
producing estimates at a greater number of discrete frequencies fk- If we sample a 
longer run of data using the same sampling rate, then the Nyquist critical frequency 
f c is unchanged, but we now have finer frequency resolution (more /fc’s) within the 
Nyquist frequency interval; alternatively, if we sample the same length of data with a 
finer sampling interval, then our frequency resolution is unchanged, but the Nyquist 
range now extends up to a higher frequency. In neither case do the additional samples 
reduce the variance of any one particular frequency’s estimated PSD. 

You don’t have to live with PSD estimates with 100 percent standard deviations, 
however. You simply have to know some techniques for reducing the variance of 
the estimates. Here are two techniques that are very nearly identical mathematically, 
though different in implementation. The first is to compute a periodogram estimate 
with finer discrete frequency spacing than you really need, and then to sum the 
periodogram estimates at K consecutive discrete frequencies to get one “smoother” 
estimate at the mid frequency of those K. The variance of that summed estimate 
will be smaller than the estimate itself by a factor of exactly 1 /K, i.e., the standard 
deviation will be smaller than 100 percent by a factor 1/ \[K. Thus, to estimate the 
power spectrum at M +1 discrete frequencies between 0 and f c inclusive, you begin 
by taking the FFT of 2 MK points (which number had better be an integer power of 
two!). You then take the modulus square of the resulting coefficients, add positive 
and negative frequency pairs, and divide by (2 MK ) 2 , all according to equation 
(13.4.5) with N = 2 MK. Finally, you “bin” the results into summed (not averaged) 
groups of K. This procedure is very easy to program, so we will not bother to give 
a routine for it. The reason that you sum, rather than average, K consecutive points 
is so that your final PSD estimate will preserve the normalization property that the 
sum of its M + 1 values equals the mean square value of the function. 

A second technique for estimating the PSD at M + 1 discrete frequencies in 
the range 0 to f c is to partition the original sampled data into K segments each of 
2 M consecutive sampled points. Each segment is separately FFT’d to produce a 
periodogram estimate (equation 13.4.5 with N = 2M). Finally, the K periodogram 
estimates are averaged at each frequency. It is this final averaging that reduces the 
variance of the estimate by a factor K (standard deviation by y/K). This second 
technique is computationally more efficient than the first technique above by a modest 
factor, since it is logarithmically more efficient to take many shorter FFTs than one 
longer one. The principal advantage of the second technique, however, is that only 
2 M data points are manipulated at a single time, not 2 KM as in the first technique. 
This means that the second technique is the natural choice for processing long runs 
of data, as from a magnetic tape or other data record. We will give a routine later 
for implementing this second technique, but we need first to return to the matters of 
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leakage and data windowing which were brought up after equation (13.4.7) above. 


Data Windowing 

The purpose of data windowing is to modify equation (13.4.7), which expresses 
the relation between the spectral estimate P k at a discrete frequency and the actual 
underlying continuous spectrum P(f) at nearby frequencies. In general, the spectral 
power in one “bin” k contains leakage from frequency components that are actually 
s bins away, where s is the independent variable in equation (13.4.7). There is, as 
we pointed out, quite substantial leakage even from moderately large values of s. 

When we select a run of N sampled points for periodogram spectral estimation, 
we are in effect multiplying an infinite run of sampled data Cj by a window function 
in time, one that is zero except during the total sampling time N A, and is unity during 
that time. In other words, the data are windowed by a square window function. By 
the convolution theorem (12.0.9; but interchanging the roles of / and t ), the Fourier 
transform of the product of the data with this square window function is equal to the 
convolution of the data’s Fourier transform with the window’s Fourier transform. In 
fact, we determined equation (13.4.7) as nothing more than the square of the discrete 
Fourier transform of the unity window function. 


W(s) = — - ^—— = — 

y > N 2 [sin(7rs/iV)J N 2 


N -1 

'y ^ g 2-irisk/N 
fc =0 


(13.4.£ 


The reason for the leakage at large values of s, is that the square window function 
turns on and off so rapidly. Its Fourier transform has substantial components 
at high frequencies. To remedy this situation, we can multiply the input data 
Cj, j =sttj ...,N — 1 by a window function Wj that changes more gradually from 
zero to a maximum and then back to zero as j ranges from 0 to N. In this case, the 
equations for the periodogram estimator (13.4.4—13.4.5) become 


N—l 

D k = Y, CjWj e 2nijk ' N k = 0,..., N - 1 (13.4.9) 

j =o 

P( 0) = P(fo) = \D 0 f 
P(£k) = -jr- |i?fc | 2 + \D N - k \ 2 ] k = 1,2,..., 

P(fc) = P(fN/2) = ^- s \D N /2\ 2 (13.4.10) 



where W ss stands for “window squared and summed,” 


N -1 

W ss = nJ 2 w j 


j =0 


(13.4.11) 
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and fk is given by (13.4.6). The more general form of (13.4.7) can now be written 
in terms of the window function Wj as 


W(s) 


1 

wZ 


N -1 

e 2nisk/N w k 

k—0 



r N / 2 


f cos(2Trsk/N)w(k — N/2) dk 
J-N/2 


(13.4.12) 


Here the approximate equality is useful for practical estimates, and holds for any 
window that is left-right symmetric (the usual case), and for s -C N (the case of 
interest for estimating leakage into nearby bins). The continuous function w(k—N/2) 
in the integral is meant to be some smooth function that passes through the points w k ■ 
There is a lot of perhaps unnecessary lore about choice of a window function, and 
practically every function that rises from zero to a peak and then falls again has been 
named after someone. A few of the more common (also shown in Figure 13.4.1) are: 

I i — -N I 

Wj = 1 — | | || — = “Bartlett window” (13.4.13) 

I 2 N I 

(The “Parzen window” is very similar to this.) 


Wj 



“Hann window” 


(13.4.14) 


(The “Hamming window” is similar but does not go exactly to zero at the ends.) 


Wj = 1 — 



“Welch window” 


(13.4.15) 


We are inclined to follow Welch in recommending that you use either (13.4.13) 
or (13.4.15) in practical work. However, at the level of this book, there is 
effectively no difference between any of these (or similar) window functions. Their 
difference lies in subtle trade-offs among the various figures of merit that can be 
used to describe the narrowness or peakedness of the spectral leakage functions 
computed by (13.4.12). These figures of merit have such names as: highest sidelobe 
level (dB), sidelobe fall-off (dB per octave), equivalent noise bandwidth (bins), 3-dB 
bandwidth (bins), scallop loss (dB), worst case process loss (dB). Roughly speaking, 
the principal trade-off is between making the central peak as narrow as possible 
versus making the tails of the distribution fall off as rapidly as possible. For 
details, see (e.g.) [2], Figure 13.4.2 plots the leakage amplitudes for several windows 
already discussed. 

There is particularly a lore about window functions that rise smoothly from 
zero to unity in the first small fraction (say 10 percent) of the data, then stay at 
unity until the last small fraction (again say 10 percent) of the data, during which 
the window function falls smoothly back to zero. These windows will squeeze a 
little bit of extra narrowness out of the main lobe of the leakage function (never as 
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much as a factor of two, however), but trade this off by widening the leakage tail 
by a significant factor (e.g., the reciprocal of 10 percent, a factor of ten). If we 
distinguish between the width of a window (number of samples for which it is at 
its maximum value) and its rise/fall time (number of samples during which it rises 
and falls); and if we distinguish between the FWHM (full width to half maximum 
value) of the leakage function’s main lobe and the leakage width (full width that 
contains half of the spectral power that is not contained in the main lobe); then these 
quantities are related roughly by 




For the windows given above in (13.4.13)—(13.4.15), the effective window 
widths and the effective window rise/fall times are both of order ^N. Generally 
speaking, we feel that the advantages of windows whose rise and fall times are 
only small fractions of the data length are minor or nonexistent, and we avoid using 
them. One sometimes hears it said that flat-topped windows “throw away less of 
the data,” but we will now show you a better way of dealing with that problem by 
use of overlapping data segments. 

Let us now suppose that we have chosen a window function, and that we are 
ready to segment the data into K segments of N = 2M points. Each segment will 
be FFT’d, and the resulting K periodograms will be averaged together to obtain a 
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offset in units of frequency bins 

Figure 13.4.2. Leakage functions for the window functions of Figure 13.4.1. A signal whose frequency 
is actually located at zero offset “leaks” into neighboring bins with the amplitude shown. The purpose 
of windowing is to reduce the leakage at large offsets, where square (no) windowing has large sidelobes. 
Offset can have a fractional value, since the actual signal frequency can be located between two frequency 
bins of the FFT. 


PSD estimate at M + 1 frequency values from 0 to f c . We must now distinguish 
between two possible situations. We might want to obtain the smallest variance 
from a fixed amount of computation, without regard to the number of data points 
used. This will generally be the goal when the data are being gathered in real time, 
with the data-reduction being computer-limited. Alternatively, we might want to 
obtain the smallest variance from a fixed number of available sampled data points. 
This will generally be the goal in cases where the data are already recorded and 
we are analyzing it after the fact. 

In the first situation (smallest spectral variance per computer operation), it is 
best to segment the data without any overlapping. The first 2 M data points constitute 
segment number 1; the next 2 M data points constitute segment number 2; and so on, 
up to segment number K , for a total of 2 KM sampled points. The variance in this 
case, relative to a single segment, is reduced by a factor K. 

In the second situation (smallest spectral variance per data point), it turns out 
to be optimal, or very nearly optimal, to overlap the segments by one half of their 
length. The first and second sets of M points are segment number 1; the second 
and third sets of M points are segment number 2; and so on, up to segment number 
K, which is made of the A'th and K + 1st sets of M points. The total number of 
sampled points is therefore (K +1 )M, just over half as many as with nonoverlapping 
segments. The reduction in the variance is not a full factor of K, since the segments 
are not statistically independent. It can be shown that the variance is instead reduced 
by a factor of about 9A"/11 (see the paper by Welch in [3]). This is, however. 
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significantly better than the reduction of about K /2 that would have resulted if the 
same number of data points were segmented without overlapping. 

We can now codify these ideas into a routine for spectral estimation. While 
we generally avoid input/output coding, we make an exception here to show how 
data are read sequentially in one pass through a data file (referenced through the 
parameter FILE *f p). Only a small fraction of the data is in memory at any one time. 
Note that spctrm returns the power at M, not M + 1, frequencies, omitting the 
component P(f c ) at the Nyquist frequency. It would also be straightforward to 
include that component. 

#include <math.h> 

#include <stdio.h> 

#include "nrutil.h" 

#define WINDOWCj,a,b) (1.0-fabs((((j)-l)-(a))*(b))) /» Bartlett */ 

/* #define WINDOWCj,a,b) 1.0 */ /* Square */ 

/* #define WINDOWCj,a,b) (1.0-SQRC(((j)-l)-(a))*(b))) */ /* Welch */ 
void spctrmCFILE *fp, float p[], int m, int k, int ovrlap) 

Reads data from input stream specified by file pointer fp and returns as p[j] the data’s power 
(mean square amplitude) at frequency (j-l)/(2*m) cycles per gridpoint, for j=l,2, . . . ,m, 
based on (2*k+l)*m data points (if ovrlap is set true (1)) or 4*k*m data points (if ovrlap 
is set false (0)). The number of segments of the data is 2*k in both cases: The routine calls 
fourl k times, each call with 2 partitions each of 2*m real data points. 

I 

void fourl(float data[], unsigned long nn, int isign); 
int imn,m44,m43,m4,kk,joffn,joff,j2,j; 
float w,facp,facm,*wl,*w2,sumw=0.0,den=0.0; 

mm=m+m; Useful factors. 

m43= (m4=nmi+nrai) +3; 

m44=m43+l; 

wl=vector(1,m4); 

w2=vector(l,m); 

facm=m; 

facp=l.0/m; 

for (j=l;j<=mm;j++) sumw += SQR(WIND0W(j,facm,facp)); 

Accumulate the squared sum of the weights. 

for (j=l; j<=m; j++) p[j] =0.0; Initialize the spectrum to zero, 

if (ovrlap) Initialize the “save" half-buffer. 

for (j=l;j<=m;j++) fscanf (fp, "'/,f " ,&w2[j]); 
for (kk=l ;kk<=k;kk++) { 

Loop over data set segments in groups of two. 

for (joff = -1; joff<=0; joff++) { Get two complete segments into workspace, 

if (ovrlap) { 

for (j=1; j<=m; j++) wl [joff+j+j] =w2 [j] ; 
for (j=l;j<=m;j++) fscanf(fp,"%f",&w2[j]); 
j of f n=j of f +mm; 

for (j=1; j<=m; j++) wl [joffn+j+j] =w2 [j] ; 

> else { 

for (j=joff+2;j<=m4;j+=2) 
fscanf (fp, "'/.f " ,fcwl [j]); 

> 

} 

for (j=l;j<=mm;j++) { Apply the window to the data. 

j2=j+j; 

w=WIND0W(j,facm.facp); 
wl [j2] *= w; 
wl[j 2—1] *= w; 

} 

fourl(wl,mm,1); 

p[l] += (SQR(wl [1] )+SQR(wl [2])) ; 



Fourier transform the windowed data. 
Sum results into previous segments. 
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for (j=2;j<=m;j++) { 

J 2 =j + j; 

p[j] += (SQR(wl [j2] )+SQR(wl [j2—1]) 

+SQR(wl[m44-j 2])+SQR(wl[m43-j 2])) ; 

> 

den += sumw; 

> 

den *= m4; Correct normalization, 

for (j=l;j<=m;j++) p[j] /= den; Normalize the output, 

free_vector(w2,1,m); 
free_vector(wl,1,m4); 
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13.5 Digital Filtering in the Time Domain 

Suppose that you have a signal that you want to filter digitally. For example, perhaps 
you want to apply high-pass or low-pass filtering, to eliminate noise at low or high frequencies 
respectively; or perhaps the interesting part of your signal lies only in a certain frequency 
band, so that you need a bandpass filter. Or, if your measurements are contaminated by 60 
Hz power-line interference, you may need a notch filter to remove only a narrow band around 
that frequency. This section speaks particularly about the case in which you have chosen 
to do such filtering in the time domain. 

Before continuing, we hope you will reconsider this choice. Remember how convenient 
it is to filter in the Fourier domain. You just take your whole data record, FFT it, multiply 
the FFT output by a filter function H(f), and then do an inverse FFT to get back a filtered 
data set in time domain. Here is some additional background on the Fourier technique that 
you will want to take into account. 

• Remember that you must define your filter function H(f) for both positive and 
negative frequencies, and that the magnitude of the frequency extremes is always 
the Nyquist frequency 1/(2A), where A is the sampling interval. The magnitude 
of the smallest nonzero frequencies in the FFT is ±1/(1VA), where N is the 
number of (complex) points in the FFT. The positive and negative frequencies to 
which this filter are applied are arranged in wrap-around order. 

• If the measured data are real, and you want the filtered output also to be real, then 

your arbitrary filter function should obey = Tt(f)*. You can arrange this 

most easily by picking an H that is real and even in /. 
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• If your chosen Tt(f) has sharp vertical edges in it, then the impulse response of 
your filter (the output arising from a short impulse as input) will have damped 
“ringing” at frequencies corresponding to these edges. There is nothing wrong 
with this, but if you don’t like it, then pick a smoother 'H(f). To get a first-hand 
look at the impulse response of your filter, just take the inverse FFT of your H(f). 

If you smooth all edges of the filter function over some number k of points, then 
the impulse response function of your filter will have a span on the order of a 
fraction l/k of the whole data record. 

• If your data set is too long to FFT all at once, then break it up into segments of 
any convenient size, as long as they are much longer than the impulse response 
function of the filter. Use zero-padding, if necessary. 

• You should probably remove any trend from the data, by subtracting from it a 
straight line through the first and last points (i.e., make the first and last points equal 
to zero). If you are segmenting the data, then you can pick overlapping segments 
and use only the middle section of each, comfortably distant from edge effects. 

• A digital filter is said to be causal or physically realizable if its output for a 
particular time-step depends only on inputs at that particular time-step or earlier. 

It is said to be acausal if its output can depend on both earlier and later inputs. 
Filtering in the Fourier domain is, in general, acausal, since the data are processed 
“in a batch,” without regard to time ordering. Don’t let this bother you! Acausal 
filters can generally give superior performance (e.g., less dispersion of phases, 
sharper edges, less asymmetric impulse response functions). People use causal 
filters not because they are better, but because some situations just don’t allow 
access to out-of-time-order data. Time domain filters can, in principle, be either 
causal or acausal, but they are most often used in applications where physical 
realizability is a constraint. For this reason we will restrict ourselves to the causal 
case in what follows. 

If you are still favoring time-domain filtering after all we have said, it is probably because 
you have a real-time application, for which you must process a continuous data stream and 
wish to output filtered values at the same rate as you receive raw data. Otherwise, it may 
be that the quantity of data to be processed is so large that you can afford only a very small 
number of floating operations on each data point and cannot afford even a modest-sized FFT 
(with a number of floating operations per data point several times the logarithm of the number 
of points in the data set or segment). 

Linear Filters 

The most general linear filter takes a sequence Xk of input points and produces a 
sequence y„ of output points by the formula 

M N 

Vn = ^2 Ck Xn-k + ^2,dj y n -j (13.5.1) 

fc=0 3=1 

Here the M + 1 coefficients Ck and the N coefficients dj are fixed and define the filter 
response. The filter (13.5.1) produces each new output value from the current and M previous 
input values, and from its own N previous output values. If A r = 0, so that there is no 
second sum in (13.5.1), then the filter is called nonrecursive or finite impulse response (FIR). If 
N ^ 0, then it is called recursive or infinite impulse response (IIR). (The term “HR” connotes 
only that such filters are capable of having infinitely long impulse responses, not that their 
impulse response is necessarily long in a particular application. Typically the response of an 
IIR filter will drop off exponentially at late times, rapidly becoming negligible.) 

The relation between the Ck s and dj’ s and the filter response function H (/) is 

E e k e~ 2 * ik W 

K(f) = - (13.5.2) 

i - E dj^kum 

3=1 
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where A is, as usual, the sampling interval. The Nyquist interval corresponds to /A between 
—1/2 and 1/2. For FIR filters the denominator of (13.5.2) is just unity. 

Equation (13.5.2) tells how to determine Tt(f) from the c’s and <fs. To design a filter, 
though, we need a way of doing the inverse, getting a suitable set of c’s and <f s — as small 
a set as possible, to minimize the computational burden — from a desired H(f). Entire 
books are devoted to this issue. Like many other “inverse problems,” it has no all-purpose 
solution. One clearly has to make compromises, since T~l(f) is a full continuous function, 
while the short list of c’s and d’s represents only a few adjustable parameters. The subject of 
digital filter design concerns itself with the various ways of making these compromises. We 
cannot hope to give any sort of complete treatment of the subject. We can, however, sketch 
a couple of basic techniques to get you started. For further details, you will have to consult 
some specialized books (see references). 

FIR (Nonrecursive) Filters 

When the denominator in (13.5.2) is unity, the right-hand side is just a discrete Fourier 
transform. The transform is easily invertible, giving the desired small number of Ck coefficients 
in terms of the same small number of values of Tl(fi) at some discrete frequencies fi. This 
fact, however, is not very useful. The reason is that, for values of Ck computed in this way, 
H(f) will tend to oscillate wildly in between the discrete frequencies where it is pinned 
down to specific values. 

A better strategy, and one which is the basis of several formal methods in the literature, 
is this: Start by pretending that you are willing to have a relatively large number of filter 
coefficients, that is, a relatively large value of M. Then H(f) can be fixed to desired values 
on a relatively fine mesh, and the M coefficients Ck , k = 0,..., M — 1 can be found by 
an FFT. Next, truncate (set to zero) most of the c^’s, leaving nonzero only the first, say, 
K, (co, ci,... , ck-i) and last K — 1, (cm-k+i, ■ ■ ■, cm-i). The last few Ck s are filter 
coefficients at negative lag , because of the wrap-around property of the FFT. But we don’t 
want coefficients at negative lag. Therefore we cyclically shift the array of c^’s, to bring 
everything to positive lag. (This corresponds to introducing a time-delay into the filter.) Do 
this by copying the Ck s into a new array of length M in the following order: 

(cm-k+i, ■ ■ ■ , CM-1, CO, Cl, ... , CK-ti, 0; 0, ...,0) (13.5.3) 



To see if your truncation is acceptable, take the FFT of the array (13.5.3), giving an 
approximation to your original Tt(f). You will generally want to compare the modulus 
[H(f)\ to your original function, since the time-delay will have introduced complex phases 
into the filter response. 

If the new filter function is acceptable, then you are done and have a set of 2 K — 1 
filter coefficients. If it is not acceptable, then you can either (i) increase K and try again, 
or (ii) do something fancier to improve the acceptability for the same K. An example of 
something fancier is to modify the magnitudes (but not the phases) of the unacceptable Tl(f) 
to bring it more in line with your ideal, and then to FFT to get new Ck’s. Once again set 
to zero all but the first 2 K — 1 values of these (no need to cyclically shift since you have 
preserved the time-delaying phases), then inverse transform to get a new H(f), which will 
often be more acceptable. You can iterate this procedure. Note, however, that the procedure 
will not converge if your requirements for acceptability are more stringent than your 2 K — 1 
coefficients can handle. 

The key idea, in other words, is to iterate between the space of coefficients and the space 
of functions H(f), until a Fourier conjugate pair that satisfies the imposed constraints in both 
spaces is found. A more formal technique for this kind of iteration is the Remes Exchange 
Algorithm which produces the best Chebyshev approximation to a given desired frequency 
response with a fixed number of filter coefficients (cf. §5.13). 
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HR (Recursive) Filters 

Recursive filters, whose output at a given time depends both on the current and previous 
inputs and on previous outputs, can generally have performance that is superior to nonrecursive 
filters with the same total number of coefficients (or same number of floating operations per 
input point). The reason is fairly clear by inspection of (13.5.2): A nonrecursive filter has a 
frequency response that is a polynomial in the variable l/z, where 

z = e 2,ri(/A) (13.5.4) 

By contrast, a recursive filter’s frequency response is a rational function in l/z. The class of 
rational functions is especially good at fitting functions with sharp edges or narrow features, 
and most desired filter functions are in this category. 

Nonrecursive filters are always stable. If you turn off the sequence of incoming xfs, 
then after no more than M steps the sequence of yf s produced by (13.5.1) will also turn off. 
Recursive filters, feeding as they do on their own output, are not necessarily stable. If the 
coefficients dj are badly chosen, a recursive filter can have exponentially growing, so-called 
homogeneous, modes, which become huge even after the input sequence has been turned off. 
This is not good. The problem of designing recursive filters, therefore, is not just an inverse 
problem; it is an inverse problem with an additional stability constraint. 

How do you tell if the filter (13.5.1) is stable for a given set of Ck and dj coefficients? 
Stability depends only on the d/s. The filter is stable if and only if all N complex roots 
of the characteristic polynomial equation 

N 

z N -Y^djZ*-* = 0 (13.5.5) 

3 = 1 

are inside the unit circle, i.e., satisfy 

\z\ < 1 (13.5.6) 

The various methods for constructing stable recursive filters again form a subject area 
for which you will need more specialized books. One very useful technique, however, is the 
bilinear transformation method. For this topic we define a new variable w that reparametrizes 
the frequency /, 

w ee tan[7r(/A)] = i = 4 (13.5.7) 

Don’t be fooled by the i’ s in (13.5.7). This equation maps real frequencies / into real values of 
w. In fact, it maps the Nyquist interval — | < / A < | onto the real w axis — oo < w < +oo. 
The inverse equation to (13.5.7) is 

* = e 2 ^ A) = (13.5.8) 

1 — iw 

In reparametrizing /, w also reparametrizes a, of course. Therefore, the condition for 
stability (13.5.5)—(13.5.6) can be rephrased in terms of w: If the filter response H(f) is 
written as a function of w, then the filter is stable if and only if the poles of the filter function 
(zeros of its denominator) are all in the upper half complex plane, 

Im(w) > 0 (13.5.9) 

The idea of the bilinear transformation method is that instead of specifying your desired 
Tt(f), you specify only its desired modulus square, \Tt(f)\ 2 = Tt(f)Tt(f)* = Tt(f)H(—f). 
Pick this to be approximated by some rational function in ui 2 . Then find all the poles of this 
function in the w complex plane. Every pole in the lower half-plane will have a corresponding 
pole in the upper half-plane, by symmetry. The idea is to form a product only of the factors 
with good poles, ones in the upper half-plane. This product is your stably realizable 7i(f). 
Now substitute equation (13.5.7) to write the function as a rational function in a, and compare 
with equation (13.5.2) to read off the c’s and d’s. 
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The procedure becomes clearer when we go through an example. Suppose we want to 
design a simple bandpass filter, whose lower cutoff frequency corresponds to a value w = a, 
and whose upper cutoff frequency corresponds to a value w = b, with a and b both positive 
numbers. A simple rational function that accomplishes this is 



(13.5.10) 


This function does not have a very sharp cutoff, but it is illustrative of the more general 
case. To obtain sharper edges, one could take the function (13.5.10) to some positive integer 
power, or, equivalently, run the data sequentially through some number of copies of the filter 
that we will obtain from (13.5.10). 

The poles of (13.5.10) are evidently at w = Ha and w = ±ib. Therefore the stably 
realizable H(f) is 




(13.5.11) 


We put the i in the numerator of the second factor in order to end up with real-valued 
coefficients. If we multiply out all the denominators, (13.5.11) can be rewritten in the form 


n(f) = 


_ .j - 2 

1 (l+a)(l-6)+(l-a)(l+t>) 1 , (l-q)(l-6) 2 


(13.5.12) 


from which one reads off the filter coefficients for equation (13.5.1), 


co = - 
ci =0 
C2 = 


(l + a)(l + 6) 

b 


d i = 


C?2 = — 


(1 + °)(1 + b) 

(1 + a)(l — 6) + (1 — a)(l + b) 
(1 + a)(l ~r 6) 

(l-a)(l-6) 


(l + a)(l + 6) 


(13.5.13) 


This completes the design of the bandpass filter. 

Sometimes you can figure out how to construct directly a rational function in w for H(f), 
rather than having to start with its modulus square. The function that you construct has to have 
its poles only in the upper half-plane, for stability. It should also have the property of going into 
its own complex conjugate if you substitute — w for w, so that the filter coefficients will be real. 

For example, here is a function for a notch filter, designed to remove only a narrow 
frequency band around some fiducial frequency w = u>o, where wo is a positive number, 


f w - Wo \ 

( W + wo \ 

\w — wo — iewo J 

y w + u;o — iewo J 


2 2 
W — Wq 

(w — iewo) 2 — Wq 


(13.5.14) 



In (13.5.14) the parameter e is a small positive number that is the desired width of the notch, as a 
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Figure 13.5.1. (a) A “chirp,” or signal whose frequency increases continuously with time, (b) Same 

signal after it has passed through the notch filter (13.5.15). The parameter e is here 0.2. 


fraction of Wq. Going through the arithmetic of substituting z for w gives the filter coefficients 


_ 1 + Wq 

0 (l + ew 0 ) 2 +w§ 

r _ o 1 ~ w o 

1 (1 + ewo) 2 + wl 

_ 1 + Wo 

02 

d = g 1 ~ e2w o ~ w o 

1 (1 + two) 2 + Wq 

, _ (1 — two) 2 + Wq 

2 (1 + two) 2 + Wq 


(13.5.15) 


Figure 13.5.1 shows the results of using a filter of the form (13.5.15) on a “chirp” input 
signal, one that glides upwards in frequency, crossing the notch frequency along the way. 

While the bilinear transformation may seem very general, its applications are limited 
by some features of the resulting filters. The method is good at getting the general shape 
of the desired filter, and good where “flatness” is a desired goal. However, the nonlinear 
mapping between w and / makes it difficult to design to a desired shape for a cutoff, and 
may move cutoff frequencies (defined by a certain number of dB) from their desired places. 
Consequently, practitioners of the art of digital filter design reserve the bilinear transformation 
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for specific situations, and arm themselves with a variety of other tricks. We suggest that 
you do likewise, as your projects demand. 
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13.6 Linear Prediction and Linear Predictive 
Coding 

We begin with a very general formulation that will allow us to make connections 
to various special cases. Let {y' a } be a set of measured values for some underlying 
set of true values of a quantity y , denoted {y Q }, related to these true values by 
the addition of random noise, 


y' a =y a + n a (13.6.1) 

(compare equation 13.3.2, with a somewhat different notation). Our use of a Greek 
subscript to index the members of the set is meant to indicate that the data points 
are not necessarily equally spaced along a line, or even ordered: they might be 
“random” points in three-dimensional space, for example. Now, suppose we want to 
construct the “best” estimate of the true value of some particular point y * as a linear 
combination of the known, noisy, values. Writing 

V* = X] d *<*y'c* + x * (13.6.2) 

we want to find coefficients d to that minimize, in some way, the discrepancy x *. 
The coefficients have a “star” subscript to indicate that they depend on the choice 
of point y*. Later, we might want to let y* be one of the existing y a ’s. In that case, 
our problem becomes one of optimal filtering or estimation, closely related to the 
discussion in §13.3. On the other hand, we might want y* to be a completely new 
point. In that case, our problem will be one of linear prediction. 

A natural way to minimize the discrepancy x * is in the statistical mean square 
sense. If angle brackets denote statistical averages, then we seek d* a ’s that minimize 

(4) = ^+ n °) - y*\ ^ 

= lavM + (««7i/3))rf*arf*/3 - 2 d *» + ( y *) 

a(3 ot 



(13.6.3) 
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Here we have used the fact that noise is uncorrelated with signal, e.g., (n a yp) = 0. 
The quantities (y a yp) and (tuy a ) describe the autocorrelation structure of the 
underlying data. We have already seen an analogous expression, (13.2.2), for the 
case of equally spaced data points on a line; we will meet correlation several times 
again in its statistical sense in Chapters 14 and 15. The quantities ( n a np ) describe the 
autocorrelation properties of the noise. Often, for point-to-point uncorrelated noise, 
we have (n a np) = (nj :x ) S a p. It is convenient to think of the various correlation 
quantities as comprising matrices and vectors, 

(t>ap = {y a yp) <f>*a = (y*y a ) Vap = (n a np) or (n^)6 a p (13.6.4) 

Setting the derivative of equation (13.6.3) with respect to the d * a ’s equal to zero, 
one readily obtains the set of linear equations, 

X [0«/3 + Vap] d*p = 0*a (13.6.5) 

P 

If we write the solution as a matrix inverse, then the estimation equation (13.6.2) 
becomes, omitting the minimized discrepancy a;*, 

y*~Yl $* a + y'p (13.6.6) 

ap 

From equations (13.6.3) and (13.6.5) one can also calculate the expected mean square 
value of the discrepancy at its minimum, denoted (a;*) 0 , 

(z*) 0 = (y*) - X d *0 ( t > *p = (y*) - X &*<* + y^Lp <t>*p (13.6.7) 

P aP 

A final general result tells how much the mean square discrepancy (xj) is 
increased if we use the estimation equation (13.6.2) not with the best values d+p, but 
with some other values d*p. The above equations then imply 

(xl) = (xl) 0 + - d* a ) [<p a p + 7] a p\ (d+p - d*p) (13.6.8) 

ap 

Since the second term is a pure quadratic form, we see that the increase in the 
discrepancy is only second order in any error made in estimating the d+p’s. 

Connection to Optimal Filtering 

If we change “star” to a Greek index, say 7, then the above formulas describe 
optimal filtering, generalizing the discussion of §13.3. One sees, for example, that 
if the noise amplitudes n a go to zero, so likewise do the noise autocorrelations 
rfcxp, and, canceling a matrix times its inverse, equation (13.6.6) simply becomes 
j/ 7 = y' Another special case occurs if the matrices <p a p and r) a p are diagonal. 
In that case, equation (13.6.6) becomes 

077 / 

2/7 = T — ~r - y-y 

077 + ^77 



(13.6.9) 
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which is readily recognizable as equation (13.3.6) with S' 2 * 0 77 , N 2 —> r/ 77 . What 
is going on is this: For the case of equally spaced data points, and in the Fourier 
domain, autocorrelations become simply squares of Fourier amplitudes (Wiener- 
Khinchin theorem, equation 12.0.12), and the optimal filter can be constructed 
algebraically, as equation (13.6.9), without inverting any matrix. 

More generally, in the time domain, or any other domain, an optimal filter (one 
that minimizes the square of the discrepancy from the underlying true value in the 
presence of measurement noise) can be constructed by estimating the autocorrelation 
matrices 4> a p and r] a y, and applying equation (13.6.6) with * —> 7. (Equation 
13.6.8 is in fact the basis for the §13.3’s statement that even crude optimal filtering 
can be quite effective.) 

Linear Prediction 

Classical linear prediction specializes to the case where the data points yp 
are equally spaced along a line, y t , i = 1,2,..., TV, and we want to use M 
consecutive values of y t to predict an M + 1st. Stationarity is assumed. That is, the 
autocorrelation {yjVk} is assumed to depend only on the difference | j — k |, and not 
on j or k individually, so that the autocorrelation <j> has only a single index, 

1 N ~i 

<t>j = iViVi+j) » Pf — J2 ViVi+i (13.6.10) 

J i=l 

Here, the approximate equality shows one way to use the actual data set values to 
estimate the autocorrelation components. (In fact, there is a better way to make these 
estimates; see below.) In the situation described, the estimation equation (13.6.2) is 


M 

yn = J2 d jy„-j + x n (13.6.11) 

1=1 

(compare equation 13.5.1) and equation (13.6.5) becomes the set of M equations for 
the M unknown dj’s, now called the linear prediction (LP) coefficients, 

M 

Y <hi-H dj = fa pi- 1 , ..., M) (13.6.12) 

l=i 

Notice that while noise is not explicitly included in the equations, it is properly 
accounted for, if it is point-to-point uncorrelated: <j> 0 , as estimated by equation 

(13.6.10) using measured values y[, actually estimates the diagonal part of <p ao + r/ Qa , 
above. The mean square discrepancy (x 2 ) is estimated by equation (13.6.7) as 

(x 2 ) = 4>q- <Mi - <M2- (j>Md M (13.6.13) 

To use linear prediction, we first compute the df s, using equations (13.6.10) 
and (13.6.12). We then calculate equation (13.6.13) or, more concretely, apply 

(13.6.11) to the known record to get an idea of how large are the discrepancies x *. 
If the discrepancies are small, then we can continue applying (13.6.11) right on into 
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the future, imagining the unknown “future” discrepancies x , to be zero. In this 
application, (13.6.11) is a kind of extrapolation formula. In many situations, this 
extrapolation turns out to be vastly more powerful than any kind of simple polynomial 
extrapolation. (By the way, you should not confuse the terms “linear prediction” and 
“linear extrapolation”; the general functional form used by linear prediction is much 
more complex than a straight line, or even a low-order polynomial!) 

However, to achieve its full usefulness, linear prediction must be constrained in 
one additional respect: One must take additional measures to guarantee its stability. 
Equation (13.6.11) is a special case of the general linear filter (13.5.1). The condition 
that (13.6.11) be stable as a linear predictor is precisely that given in equations 
(13.5.5) and (13.5.6), namely that the characteristic polynomial 

N 

z N -Y,djZ N ~ j = ° (13.6.14) 

3 = 1 

have all N of its roots inside the unit circle, 

\z\ < 1 (13.6.15) 

There is no guarantee that the coefficients produced by equation (13.6.12) will have 
this property. If the data contain many oscillations without any particular trend 
towards increasing or decreasing amplitude, then the complex roots of (13.6.14) 
will generally all be rather close to the unit circle. The finite length of the data 
set will cause some of these roots to be inside the unit circle, others outside. In 
some applications, where the resulting instabilities are slowly growing and the linear 
prediction is not pushed too far, it is best to use the “unmassaged” LP coefficients 
that come directly out of (13.6.12). For example, one might be extrapolating to fill a 
short gap in a data set; then one might extrapolate both forwards across the gap and 
backwards from the data beyond the gap. If the two extrapolations agree tolerably 
well, then instability is not a problem. 

When instability is a problem, you have to “massage” the LP coefficients. You 
do this by (i) solving (numerically) equation (13.6.14) for its N complex roots; (ii) 
moving the roots to where you think they ought to be inside or on the unit circle; (iii) 
reconstituting the now-modified LP coefficients. You may think that step (ii) sounds 
a little vague. It is. There is no “best” procedure. If you think that your signal 
is truly a sum of undamped sine and cosine waves (perhaps with incommensurate 
periods), then you will want simply to move each root Zi onto the unit circle, 

Zi -► Zi/\ Zi \ (13.6.16) 



a S3 

In other circumstances it may seem appropriate to reflect a bad root across the 

unit circle ® ® 


Zi -► 1 /zi* (13.6.17) 

This alternative has the property that it preserves the amplitude of the output of 
(13.6.11) when it is driven by a sinusoidal set of Xi’s. It assumes that (13.6.12) 
has correctly identified the spectral width of a resonance, but only slipped up on 
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identifying its time sense so that signals that should be damped as time proceeds end 
up growing in amplitude. The choice between (13.6.16) and (13.6.17) sometimes 
might as well be based on voodoo. We prefer (13.6.17). 

Also magical is the choice of M, the number of LP coefficients to use. You 
should choose M to be as small as works for you, that is, you should choose it by 
experimenting with your data. Try M = 5,10, 20,40. If you need larger M’s than 
this, be aware that the procedure of “massaging” all those complex roots is quite 
sensitive to roundoff error. Use double precision. 

Linear prediction is especially successful at extrapolating signals that are smooth 
and oscillatory, though not necessarily periodic. In such cases, linear prediction often 
extrapolates accurately through many cycles of the signal. By contrast, polynomial 
extrapolation in general becomes seriously inaccurate after at most a cycle or two. 
A prototypical example of a signal that can successfully be linearly predicted is the 
height of ocean tides, for which the fundamental 12-hour period is modulated in 
phase and amplitude over the course of the month and year, and for which local 
hydrodynamic effects may make even one cycle of the curve look rather different 
in shape from a sine wave. 

We already remarked that equation (13.6.10) is not necessarily the best way 
to estimate the covariances rj>k from the data set. In fact, results obtained from 
linear prediction are remarkably sensitive to exactly how the </>fc’s are estimated. 
One particularly good method is due to Burg [1 ], and involves a recursive procedure 
for increasing the order M by one unit at a time, at each stage re-estimating the 
coefficients dj, j = 1,..., M so as to minimize the residual in equation (13.6.13). 
Although further discussion of the Burg method is beyond our scope here, the method 
is implemented in the following routine [1,2] for estimating the LP coefficients dj 
of a data set. 

#include <math.h> 

#include "nrutil.h" 

void memcof (float data[] , int n, int m, float *xms, float d[]) 

Given a real vector of data[l. .n] , and given m, this routine returns m linear prediction coef¬ 
ficients as d[l. .m] , and returns the mean square discrepancy as xms. 

i 

int k,j,i; 

float p=0.0,*wkl,*wk2,*wkm; 

wkl=vector(l,n); 
wk2=vector(l,n); 
wkm=vector(l,m); 

for (j=l;j<=n;j++) p += SQR(data[j]); 

*xms=p/n; 
wkl [l]=data[l] ; 
wk2 [n-1] =data [n] ; 
for (j=2;j<=n-l;j++) { 
wkl[j]=data[j] ; 
wk2 [ j -1] =data [ j ] ; 

> 

for (k=l;k<=m;k++) { 

float nuin=0.0, denom=0.0; 
for (j=i;j<=(n-k);j++) { 
num += wkl [j] *wk2[j] ; 
denom += SQRCwkl[j])+SQR(wk2[j]); 

> 

d[k]=2.0*num/denom; 
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*xms *= (1.0-SQR(d[k])) ; 

for (i=l;i<=(k-l);i++) 

d [i] =wkm [i] -d [k] *wkm [k-i] ; 

The algorithm is recursive, building up the answer for larger and larger values of m 
until the desired value is reached. At this point in the algorithm, one could return 
the vector d and scalar xms for a set of LP coefficients with k (rather than m) 
terms. 

if (k == m) { 

free_vector(wkm,l,m); 
free_vector(wk2,1,n); 
free_vector(wkl,1,n); 
return; 

> 

for (i=l;i<=k;i++) wkm[i]=d[i]; 

for (j=l;j<=(n-k-l);j++) { 
wkl[j] -= wkm[k]*wk2[j] ; 
wk2 [j ] =wk2 [j+1] -wkm [k] *wkl[j+l] ; 

> 

> 

nrerror("never get here in memcof."); 


Here are procedures for rendering the LP coefficients stable (if you choose to 
do so), and for extrapolating a data set by linear prediction, using the original or 
massaged LP coefficients. The routine zroots (§9.5) is used to find all complex 
roots of a polynomial. 


#include <math.h> 

#include 11 complex, h" 

#define NMAX 100 Largest expected value of m. 

#define ZERO Complex(0.0,0.0) 

#define ONE Complex(l.0,0.0) 


void fixrts (float d[] , int m) 

Given the LP coefficients d[l. .m] , this routine finds all roots of the characteristic polynomial 
(13.6.14), reflects any roots that are outside the unit circle back inside, and then returns a 
modified set of coefficients d[l. .m], 

{ 

void zroots(fcomplex a[] , int m, fcomplex roots[], int polish); 

int i,j,polish; 

fcomplex a [NMAX],roots[NMAX]; 


> 


a [m] =0NE; 

for (j =m— 1; j >=0; j—) 

a[j]=Complex(-d[m-j] ,0.0); 
polish=l; 

zroots(a,m,roots,polish); 
for (j=l;j<=m;j++) 

if (Cabs(roots[j]) > 1.0) 


Set up complex coefficients for polynomial root 
finder. 


Find all the roots. 

Look for a... 

root outside the unit circle, 

roots [j] =Cdiv(0NE,Conjg(roots [j] )); and reflect it back inside, 
a[0]=Csub(ZERO,roots[1]) ; Now reconstruct the polynomial coefficients, 

a[l]=0NE; 

for (j=2; j<=m; j++) { by looping over the roots 

a[j]=0NE; 

for (i=j;i>=2;i—) and synthetically multiplying. 

a[i-l] =Csub(a[i-2] ,Cmul(roots [j] , a[i-l] )) ; 
a[0] =Csub (ZERO,Cmul (roots [j] ,a[0] )); 

> 


for (j=0;j<=m-l;j++) 
d[m-j] = -a[j] .r; 


The polynomial coefficients are guaranteed to be 
real, so we need only return the real part as 
new LP coefficients. 
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#include "nrutil.h" 

void predic(float data[] , int ndata, float d[], int m, float futureG, 
int nfut) 

Given data [1. .ndata] , and given the data's LP coefficients d[l..m], this routine ap¬ 
plies equation (13.6.11) to predict the next nfut data points, which it returns in the array 
future [1. .nfut] . Note that the routine references only the last m values of data, as initial 
values for the prediction. 

{ 

int k,j; 

float sum,discrp,*reg; 
reg=vector(l,m); 

for (j=l; j<=m; j++) reg[j]=data[ndata+l-j] ; 
for (j=l;j<=nfut;j++) { 
discrp=0.0; 

This is where you would put in a known discrepancy if you were reconstructing a 
function by linear predictive coding rather than extrapolating a function by linear pre¬ 
diction. See text. 
sum=discrp; 

for (k=l;k<=m;k++) sum += d[k]*reg[k]; 

for (k=m;k>=2;k—) reg[k]=reg[k-l] ; [If you want to implement circular 

future [j]=reg[l]=sum; arrays, you can avoid this shift- 

> ing of coefficients.] 

free_vector(reg,l,m); 


Removing the Bias in Linear Prediction 

You might expect that the sum of the dj’s in equation (13.6.11) (or, more 
generally, in equation 13.6.2) should be 1, so that (e.g.) adding a constant to all the 
data points yi yields a prediction that is increased by the same constant. However, 
the dj ’s do not sum to 1 but, in general, to a value slightly less than one. This fact 
reveals a subtle point, that the estimator of classical linear prediction is not unbiased, 
even though it does minimize the mean square discrepancy. At any place where the 
measured autocorrelation does not imply a better estimate, the equations of linear 
prediction tend to predict a value that tends towards zero. 

Sometimes, that is just what you want. If the process that generates the yi’s 
in fact has zero mean, then zero is the best guess absent other information. At 
other times, however, this behavior is unwarranted. If you have data that show 
only small variations around a positive value, you don’t want linear predictions 
that droop towards zero. 

Often it is a workable approximation to subtract the mean off your data set, 
perform the linear prediction, and then add the mean back. This procedure contains 
the germ of the correct solution; but the simple arithmetic mean is not quite the 
correct constant to subtract. In fact, an unbiased estimator is obtained by subtracting 
from every data point an autocorrelation-weighted mean defined by [3,4] 

V = 'y 1 [4 > u v "t" Vp j y 1 \ < ! > u v "T huAap (13.6.18) 

0 1 <*0 

With this subtraction, the sum of the LP coefficients should be unity, up to roundoff 
and differences in how the fik’s are estimated. 
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Linear Predictive Coding (LPC) 

A different, though related, method to which the formalism above can be 
applied is the “compression” of a sampled signal so that it can be stored more 
compactly. The original form should be exactly recoverable from the compressed 
version. Obviously, compression can be accomplished only if there is redundancy 
in the signal. Equation (13.6.11) describes one kind of redundancy: It says that 
the signal, except for a small discrepancy, is predictable from its previous values 
and from a small number of LP coefficients. Compression of a signal by the use of 
(13.6.11) is thus called linear predictive coding, or LPC. 

The basic idea of LPC (in its simplest form) is to record as a compressed file (i) 
the number of LP coefficients M, (ii) their M values, e.g., as obtained by memcof, 
(iii) the first M data points, and then (iv) for each subsequent data point only its 
residual discrepancy Xi (equation 13.6.1). When you are creating the compressed 
file, you find the residual by applying (13.6.1) to the previous M points, subtracting 
the sum from the actual value of the current point. When you are reconstructing the 
original file, you add the residual back in, at the point indicated in the routine predi c. 

It may not be obvious why there is any compression at all in this scheme. After 
all, we are storing one value of residual per data point! Why not just store the original 
data point? The answer depends on the relative sizes of the numbers involved. The 
residual is obtained by subtracting two very nearly equal numbers (the data and the 
linear prediction). Therefore, the discrepancy typically has only a very small number 
of nonzero bits. These can be stored in a compressed file. How do you do it in a 
high-level language? Here is one way: Scale your data to have integer values, say 
between +1000000 and —1000000 (supposing that you need six significant figures). 
Modify equation (13.6.1) by enclosing the sum term in an “integer part of” operator. 
The discrepancy will now, by definition, be an integer. Experiment with different 
values of M, to find LP coefficients that make the range of the discrepancy as small 
as you can. If you can get to within a range of ±127 (and in our experience this is not 
at all difficult) then you can write it to a file as a single byte. This is a compression 
factor of 4, compared to 4-byte integer or floating formats. 

Notice that the LP coefficients are computed using the quantized data, and that 
the discrepancy is also quantized, i.e., quantization is done both outside and inside 
the LPC loop. If you are careful in following this prescription, then, apart from the 
initial quantization of the data, you will not introduce even a single bit of roundoff 
error into the compression-reconstruction process: While the evaluation of the sum 
in (13.6.11) may have roundoff errors, the residual that you store is the value which, 
when added back to the sum, gives exactly the original (quantized) data value. Notice 
also that you do not need to massage the LP coefficients for stability; by adding the 
residual back in to each point, you never depart from the original data, so instabilities 
cannot grow. There is therefore no need for f ixrts, above. 

Look at §20.4 to learn about Huffman coding, which will further compress the 
residuals by taking advantage of the fact that smaller values of discrepancy will occur 
more often than larger values. A very primitive version of Huffman coding would 
be this: If most of the discrepancies are in the range ±127, but an occasional one is 
outside, then reserve the value 127 to mean “out of range,” and then record on the file 
(immediately following the 127) a full-word value of the out-of-range discrepancy. 
§20.4 explains how to do much better. 
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There are many variant procedures that all fall under the rubric of LPC. 

• If the spectral character of the data is time-variable, then it is best not 
to use a single set of LP coefficients for the whole data set, but rather 
to partition the data into segments, computing and storing different LP 
coefficients for each segment. 

• If the data are really well characterized by their LP coefficients, and you 
can tolerate some small amount of error, then don’t bother storing all of the 
residuals. Just do linear prediction until you are outside of tolerances, then 
reinitialize (using M sequential stored residuals) and continue predicting. 

• In some applications, most notably speech synthesis, one cares only about 
the spectral content of the reconstructed signal, not the relative phases. 

In this case, one need not store any starting values at all, but only the 
LP coefficients for each segment of the data. The output is reconstructed 
by driving these coefficients with initial conditions consisting of all zeros 
except for one nonzero spike. A speech synthesizer chip may have of 
order 10 LP coefficients, which change perhaps 20 to 50 times per second. 

• Some people believe that it is interesting to analyze a signal by LPC, even 
when the residuals Xi are not small. The Xi s are then interpreted as the 
underlying “input signal” which, when filtered through the all-poles filter 
defined by the LP coefficients (see §13.7), produces the observed “output 
signal.” LPC reveals simultaneously, it is said, the nature of the filter and 
the particular input that is driving it. We are skeptical of these applications; 
the literature, however, is full of extravagant claims. 

CITED REFERENCES AND FURTHER READING: 

Childers, D.G. (ed.) 1978, Modern Spectrum Analysis (New York: IEEE Press), especially the 

paper by J. Makhoul (reprinted from Proceedings of the IEEE, vol. 63, p. 561,1975). 

Burg, J.P. 1968, reprinted in Childers, 1978. [1] 

Anderson, N. 1974, reprinted in Childers, 1978. [2] 

Cressie, N. 1991, in Spatial Statistics and Digital Image Analysis (Washington: National Academy 

Press). [3] 

Press, W.H., and Rybicki, G.B. 1992, Astrophysical Journal, vol. 398, pp. 169-176. [4] 


13.7 Power Spectrum Estimation by the 
Maximum Entropy (All Poles) Method 


The FFT is not the only way to estimate the power spectrum of a process, nor is it 
necessarily the best way for all purposes. To see how one might devise another method, 
let us enlarge our view for a moment, so that it includes not only real frequencies in the 
Nyquist interval —f c <f<fc, but also the entire complex frequency plane. From that 
vantage point, let us transform the complex /-plane to a new plane, called the z-transform 
plane or z-plane, by the relation 

Z m e 2nifA (13.7.1) 

where A is, as usual, the sampling interval in the time domain. Notice that the Nyquist interval 
on the real axis of the /-plane maps one-to-one onto the unit circle in the complex 2 -plane. 
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If we now compare (13.7.1) to equations (13.4.4) and (13.4.6), we see that the FFT 
power spectrum estimate (13.4.5) for any real sampled function Ck = c(t k ) can be written, 
except for normalization convention, as 

JV/2-1 2 

nf)= 52 ckzk ( 13 - 7 - 2 ) 

k=-N/ 2 

Of course, (13.7.2) is not the true power spectrum of the underlying function c(t), but only an 
estimate. We can see in two related ways why the estimate is not likely to be exact. First, in the 
time domain, the estimate is based on only a finite range of the function c(t ) which may, for all 
we know, have continued from t = — oo to oo. Second, in the 2 -plane of equation (13.7.2), the 
finite Laurent series offers, in general, only an approximation to a general analytic function of 
2 . In fact, a formal expression for representing “true” power spectra (up to normalization) is 

oo 2 

p(f)= 52 ck * k ( 13 - 7 - 3 ) 

k=-oo 

This is an infinite Laurent series which depends on an infinite number of values Ck. Equation 
(13.7.2) is just one kind of analytic approximation to the analytic function of 2 represented 
by (13.7.3); the kind, in fact, that is implicit in the use of FFTs to estimate power spectra by 
periodogram methods. It goes under several names, including direct method, all-zero model, 
and moving average (MA) model. The term “all-zero” in particular refers to the fact that the 
model spectrum can have zeros in the 2 -plane, but not poles. 

If we look at the problem of approximating (13.7.3) more generally it seems clear that 
we could do a better job with a rational function, one with a series of type (13.7.2) in both the 
numerator and the denominator. Less obviously, it turns out that there are some advantages in 
an approximation whose free parameters all lie in the denominator, namely, 

P (f) « - 1 -2 = 7-^-Ta (13.7.4) 

M/2 M 

£ b k Z k 1+E Ok2* 

k=-M/2 I k=1 I 

Here the second equality brings in a new set of coefficients a k s, which can be determined 
from the b k s using the fact that 2 lies on the unit circle. The b k ’s can be thought of as 
being determined by the condition that power series expansion of (13.7.4) agree with the 
first M + 1 terms of (13.7.3). In practice, as we shall see, one determines the bk’s or 
a k s by another method. 

The differences between the approximations (13.7.2) and (13.7.4) are not just cosmetic. 
They are approximations with very different character. Most notable is the fact that (13.7.4) 
can have poles, corresponding to infinite power spectral density, on the unit 2 -circle, i.e., at 
real frequencies in the Nyquist interval. Such poles can provide an accurate representation 
for underlying power spectra that have sharp, discrete “lines” or delta-functions. By contrast, 
(13.7.2) can have only zeros, not poles, at real frequencies in the Nyquist interval, and must 
thus attempt to fit sharp spectral features with, essentially, a polynomial. The approximation 
(13.7.4) goes under several names: all-poles model, maximum entropy method (MEM), 
autoregressive model (AR). We need only find out how to compute the coefficients ao and the 
dfc’s from a data set, so that we can actually use (13.7.4) to obtain spectral estimates. 

A pleasant surprise is that we already know how! Look at equation (13.6.11) for linear 
prediction. Compare it with linear filter equations (13.5.1) and (13.5.2), and you will see that, 
viewed as a filter that takes input ai’s into output y’s, linear prediction has a filter function 

«(/) = -jf- ( 13 - 7 - 5 ) 

1 - E d i z ~ j 



Thus, the power spectrum of the y’s should be equal to the power spectrum of the ie’s 
multiplied by |7f(/)| 2 . Now let us think about what the spectrum of the input x’s is, when 
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they are residual discrepancies from linear prediction. Although we will not prove it formally, 
it is intuitively believable that the x’s are independently random and therefore have a flat 
(white noise) spectrum. (Roughly speaking, any residual correlations left in the x’s would 
have allowed a more accurate linear prediction, and would have been removed.) The overall 
normalization of this flat spectrum is just the mean square amplitude of the x’s. But this is 
exactly the quantity computed in equation (13.6.13) and returned by the routine memcof as 
xms. Thus, the coefficients do and o.k in equation (13.7.4) are related to the LP coefficients 
returned by memcof simply by 

ao = xms ofc = — d(fc), k = (13.7.6) 

There is also another way to describe the relation between the ak ’s and the autocorrelation 
components fa. The Wiener-Khinchin theorem (12.0.12) says that the Fourier transform of 
the autocorrelation is equal to the power spectrum. In 2 -transform language, this Fourier 
transform is just a Laurent series in 2 . The equation that is to be satisfied by the coefficients 
in equation (13.7.4) is thus 

M 

;° E ^ (13.7.7) 

1+E a fc z* j =~ M 

k= 1 I 

The approximately equal sign in (13.7.7) has a somewhat special interpretation. It means 
that the series expansion of the left-hand side is supposed to agree with the right-hand side 
term by term from z~ M to z M . Outside this range of terms, the right-hand side is obviously 
zero, while the left-hand side will still have nonzero terms. Notice that M, the number of 
coefficients in the approximation on the left-hand side, can be any integer up to N, the total 
number of autocorrelations available. (In practice, one often chooses M much smaller than 
N.) M is called the order or number of poles of the approximation. 

Whatever the chosen value of M, the series expansion of the left-hand side of (13.7.7) 
defines a certain sort of extrapolation of the autocorrelation function to lags larger than M, in 
fact even to lags larger than N, i.e., larger than the run of data can actually measure. It turns 
out that this particular extrapolation can be shown to have, among all possible extrapolations, 
the maximum entropy in a definable information-theoretic sense. Hence the name maximum 
entropy method, or MEM. The maximum entropy property has caused MEM to acquire a 
certain “cult” popularity; one sometimes hears that it gives an intrinsically “better” estimate 
than is given by other methods. Don’t believe it. MEM has the very cute property of 
being able to fit sharp spectral features, but there is nothing else magical about its power 
spectrum estimates. 

The operations count in memcof scales as the product of N (the number of data points) 
and M (the desired order of the MEM approximation). If M were chosen to be as large as 
N, then the method would be much slower than the N log N FFT methods of the previous 
section. In practice, however, one usually wants to limit the order (or number of poles) of the 
MEM approximation to a few times the number of sharp spectral features that one desires it 
to fit. With this restricted number of poles, the method will smooth the spectrum somewhat, 
but this is often a desirable property. While exact values depend on the application, one 
might take M = 10 or 20 or 50 for N = 1000 or 10000. In that case MEM estimation is 
not much slower than FFT estimation. 

We feel obliged to warn you that memcof can be a bit quirky at times. If the number of 
poles or number of data points is too large, roundoff error can be a problem, even in double 
precision. With “peaky” data (i.e., data with extremely sharp spectral features), the algorithm 
may suggest split peaks even at modest orders, and the peaks may shift with the phase of the 
sine wave. Also, with noisy input functions, if you choose too high an order, you will find 
spurious peaks galore! Some experts recommend the use of this algorithm in conjunction with 
more conservative methods, like periodograms, to help choose the correct model order, and to 
avoid getting too fooled by spurious spectral features. MEM can be finicky, but it can also do 
remarkable things. We recommend that you try it out, cautiously, on your own problems. We 
now turn to the evaluation of the MEM spectral estimate from its coefficients. 

The MEM estimation (13.7.4) is a function of continuously varying frequency /. There 
is no special significance to specific equally spaced frequencies as there was in the FFT case. 



S, § g 
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In fact, since the MEM estimate may have very sharp spectral features, one wants to be able to 
evaluate it on a very fine mesh near to those features, but perhaps only more coarsely farther 
away from them. Here is a function which, given the coefficients already computed, evaluates 
(13.7.4) and returns the estimated power spectrum as a function of /A (the frequency times 
the sampling interval). Of course, / A should lie in the Nyquist range between —1/2 and 1 /2. 


tinclude <math.h> 


float evlmem(float fdt, float d[], int m, float xms) 

Given d[l. .m] , m, xms as returned by memcof, this function returns the power spectrum 
estimate P(f) as a function of fdt = /A. 

{ 

int i; 

float sumr=l.0,sumi=0.0; 

double wr=1.0,wi=0.0,wpr,wpi,wtemp,theta; Trig, recurrences in double precision. 


theta=6.28318530717959*fdt; 
wpr=cos(theta); 
wpi=sin(theta); 
for (i=l;i<=m;i++) { 

wr=(wtemp=wr)*wpr-wi*wpi; 
wi=wi*wpr+wtemp*wpi; 
sumr -= d[i]*wr; 
sumi -= d[i]*wi; 

> 

return xms/(sumr*sumr+sumi*sumi); 


Set up for recurrence relations. 
Loop over the terms in the sum. 


These accumulate the denominator of (13.7.4). 


Equation (13.7.4). 


Be sure to evaluate P(f) on a fine enough grid to find any narrow features that may 
be there! Such narrow features, if present, can contain virtually all of the power in the data. 
You might also wish to know how the P(f) produced by the routines memcof and evlmem is 
normalized with respect to the mean square value of the input data vector. The answer is 

1 * 1/2 z- 1/2 

/ P(/A)d(/A) == 2 / P(fA)d(fA) = mean square value of data (13.7.8) 

7-1/2 Jo 

Sample spectra produced by the routines memcof and evlmem are shown in Figure 13.7.1. 


CITED REFERENCES AND FURTHER READING: 

Childers, D.G. (ed.) 1978, Modem Spectrum Analysis (New York: IEEE Press), Chapter II. 
Kay, S.M., and Marple, S.L. 1981, Proceedings of the IEEE, vol. 69, pp. 1380-1419. 


13.8 Spectral Analysis of Unevenly Sampled 
Data 


Thus far, we have been dealing exclusively with evenly sampled data, 

h„=h(nA) n 3,-2,-1,0,1, 2, 3,... (13.8.1) 

where A is the sampling interval, whose reciprocal is the sampling rate. Recall also (§12.1) 
the significance of the Nyquist critical frequency 




(13.8.2) 
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frequency / 

Figure 13.7.1. Sample output of maximum entropy spectral estimation. The input signal consists of 
512 samples of the sum of two sinusoids of very nearly the same frequency, plus white noise with about 
equal power. Shown is an expanded portion of the full Nyquist frequency interval (which would extend 
from zero to 0.5). The dashed spectral estimate uses 20 poles; the dotted, 40; the solid, 150. With the 
larger number of poles, the method can resolve the distinct sinusoids; but the flat noise background is 
beginning to show spurious peaks. (Note logarithmic scale.) 



as codified by the sampling theorem: A sampled data set like equation (13.8.1) contains 
complete information about all spectral components in a signal h(t) up to the Nyquist 
frequency, and scrambled or aliased information about any signal components at frequencies 
larger than the Nyquist frequency. The sampling theorem thus defines both the attractiveness, 
and the limitation, of any analysis of an evenly spaced data set. 

There are situations, however, where evenly spaced data cannot be obtained. A common 
case is where instrumental drop-outs occur, so that data is obtained only on a (not consecutive 
integer) subset of equation (13.8.1), the so-called missing data problem. Another case, 
common in observational sciences like astronomy, is that the observer cannot completely 
control the time of the observations, but must simply accept a certain dictated set of Vs. 

There are some obvious ways to get from unevenly spaced Vs to evenly spaced ones, as 
in equation (13.8.1). Interpolation is one way: lay down a grid of evenly spaced times on your 
data and interpolate values onto that grid; then use FFT methods. In the missing data problem, 
you only have to interpolate on missing data points. If a lot of consecutive points are missing, 
you might as well just set them to zero, or perhaps “clamp” the value at the last measured point. 
However, the experience of practitioners of such interpolation techniques is not reassuring. 
Generally speaking, such techniques perform poorly. Long gaps in the data, for example, 
often produce a spurious bulge of power at low frequencies (wavelengths comparable to gaps). 

A completely different method of spectral analysis for unevenly sampled data, one that 
mitigates these difficulties and has some other very desirable properties, was developed by 
Lomb [1 ], based in part on earlier work by Barning [2] and Vanlcek [3], and additionally 
elaborated by Scargle [4], The Lomb method (as we will call it) evaluates data, and sines 
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and cosines, only at times U that are actually measured. Suppose that there are N data 
points hi = h(ti), i = 1,..., N. Then first find the mean and variance of the data by 
the usual formulas, 


h 


N 


E* 

i 


i 

N~P I 


£(/'.-A? 
1 


(13.8.3) 


Now, the Lomb normalized periodogram (spectral power as a function of angular 
frequency u> = 2tt / > 0) is defined by 

_ 1 [ [E j( h S -h) cos oj(tj - r)j [E j(hj - h) sine o{tj - r)] j 
Pn(u>) - 2ff2 I ^ cos 2 _ T ) + sin 2 - r) j 

(13.8.4) 

Here r is defined by the relation 


tan(2uir) 


J2j cos 2utj 


(13.8.5) 


The constant r is a kind of offset that makes Pn{oj) completely independent of shifting 
all the Vs by any constant. Lomb shows that this particular choice of offset has another, 
deeper, effect: It makes equation (13.8.4) identical to the equation that one would obtain if one 
estimated the harmonic content of a data set, at a given frequency w, by linear least-squares 
fitting to the model 


h(t) = A cos uit + B sin uit 


(13.8.6) 


This fact gives some insight into why the method can give results superior to FFT methods: It 
weights the data on a “per point” basis instead of on a “per time interval” basis, when uneven 
sampling can render the latter seriously in error. 

A very common occurrence is that the measured data points hi are the sum of a periodic 
signal and independent (white) Gaussian noise. If we are trying to determine the presence 
or absence of such a periodic signal, we want to be able to give a quantitative answer to 
the question, “How significant is a peak in the spectrum Pjv(a>)?” In this question, the null 
hypothesis is that the data values are independent Gaussian random values. A very nice 
property of the Lomb normalized periodogram is that the viability of the null hypothesis can 
be tested fairly rigorously, as we now discuss. 

The word “normalized” refers to the factor a 2 in the denominator of equation (13.8.4). 
Scargle [4] shows that with this normalization, at any particular u> and in the case of the null 
hypothesis, Pn(u>) has an exponential probability distribution with unit mean. In other words, 
the probability that Pn(ijj) will be between some positive 2 and z + dz is exp(— z)dz. It 
readily follows that, if we scan some M independent frequencies, the probability that none 
give values larger than 2 is (1 — e~ z ) M . So 

P{> z) = 1 - (1 - e~ z ) M (13.8.7) 


is the false-alarm probability of the null hypothesis, that is, the significance level of any peak 
in Pn(oj) that we do see. A small value for the false-alarm probability indicates a highly 
significant periodic signal. 

To evaluate this significance, we need to know M. After all, the more frequencies we 
look at, the less significant is some one modest bump in the spectrum. (Look long enough, 
find anything!) A typical procedure will be to plot Pn(u) as a function of many closely 
spaced frequencies in some large frequency range. How many of these are independent? 

Before answering, let us first see how accurately we need to know M. The interesting 
region is where the significance is a small (significant) number, 1. There, equation (13.8.7) 
can be series expanded to give 



P(> 2 ) w Me~ z 


(13.8.8) 
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Figure 13.8.1. Example of the Lomb algorithm in action. The 100 data points (upper figure) are at 
random times between 0 and 100. Their sinusoidal component is readily uncovered (lower figure) by 
the algorithm, at a significance level better than 0.001. If the 100 data points had been evenly spaced at 
unit interval, the Nyquist critical frequency would have been 0.5. Note that, for these unevenly spaced 
points, there is no visible aliasing into the Nyquist range. 



We see that the significance scales linearly with M. Practical significance levels are numbers 
like 0.05, 0.01, 0.001, etc. An error of even ±50% in the estimated significance is often 
tolerable, since quoted significance levels are typically spaced apart by factors of 5 or 10. So 
our estimate of M need not be very accurate. 

Horne and Baliunas [5] give results from extensive Monte Carlo experiments for deter¬ 
mining M in various cases. In general M depends on the number of frequencies sampled, 
the number of data points N, and their detailed spacing. It turns out that M is very nearly 
equal to N when the data points are approximately equally spaced, and when the sampled 
frequencies “fill” (oversample) the frequency range from 0 to the Nyquist frequency f c 
(equation 13.8.2). Further, the value of M is not importantly different for random spacing of 
the data points than for equal spacing. When a larger frequency range than the Nyquist range 
is sampled, M increases proportionally. About the only case where M differs significantly 
from the case of evenly spaced points is when the points are closely clumped, say into 
groups of 3; then (as one would expect) the number of independent frequencies is reduced 
by a factor of about 3. 

The program period, below, calculates an effective value for M based on the above 
rough-and-ready rules and assumes that there is no important clumping. This will be adequate 
for most purposes. In any particular case, if it really matters, it is not too difficult to compute 
a better value of M by simple Monte Carlo: Holding fixed the number of data points and their 
locations t», generate synthetic data sets of Gaussian (normal) deviates, find the largest values 
of -Pjv(w) for each such data set (using the accompanying program), and fit the resulting 
distribution for M in equation (13.8.7). 
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Figure 13.8.1 shows the results of applying the method as discussed so far. In the 
upper figure, the data points are plotted against time. Their number is N = 100, and their 
distribution in t is Poisson random. There is certainly no sinusoidal signal evident to the eye. 
The lower figure plots Pn{oj) against frequency / = uj/2tt. The Nyquist critical frequency 
that would obtain if the points were evenly spaced is at f = f c = 0.5. Since we have searched 
up to about twice that frequency, and oversampled the /’s to the point where successive values 
of Pn(uj) vary smoothly, we take M = 2N. The horizontal dashed and dotted lines are 
(respectively from bottom to top) significance levels 0.5, 0.1, 0.05, 0.01, 0.005, and 0.001. 
One sees a highly significant peak at a frequency of 0.81. That is in fact the frequency of the 
sine wave that is present in the data. (You will have to take our word for this!) 

Note that two other peaks approach, but do not exceed the 50% significance level; that 
is about what one might expect by chance. It is also worth commenting on the fact that the 
significant peak was found (correctly) above the Nyquist frequency and without any significant 
aliasing down into the Nyquist interval! That would not be possible for evenly spaced data. It 
is possible here because the randomly spaced data has some points spaced much closer than 
the “average” sampling rate, and these remove ambiguity from any abasing. 

Implementation of the normalized periodogram in code is straightforward, with, however, 
a few points to be kept in mind. We are dealing with a slow algorithm. Typically, for N data 
points, we may wish to examine on the order of 2N or AN frequencies. Each combination 
of frequency and data point has, in equations (13.8.4) and (13.8.5), not just a few adds or 
multiplies, but four calls to trigonometric functions; the operations count can easily reach 
several hundred times N 2 . It is highly desirable — in fact results in a factor 4 speedup — 
to replace these trigonometric calls by recurrences. That is possible only if the sequence of 
frequencies examined is a linear sequence. Since such a sequence is probably what most users 
would want anyway, we have built this into the implementation. 

At the end of this section we describe a way to evaluate equations (13.8.4) and (13.8.5) 
— approximately, but to any desired degree of approximation — by a fast method [6] whose 
operation count goes only as N log N. This faster method should be used for long data sets. 

The lowest independent frequency / to be examined is the inverse of the span of the 
input data, max; (f;) — min* (t,) = T. This is the frequency such that the data can include one 
complete cycle. In subtracting off the data’s mean, equation (13.8.4) already assumed that you 
are not interested in the data’s zero-frequency piece — which is just that mean value. In an 
FFT method, higher independent frequencies would be integer multiples of 1 /T. Because we 
are interested in the statistical significance of any peak that may occur, however, we had better 
(over-) sample more finely than at interval 1 /T, so that sample points lie close to the top of 
any peak. Thus, the accompanying program includes an oversampling parameter, called of ac; 
a value ofac ^ 4 might be typical in use. We also want to specify how high in frequency 
to go, say fhi. One guide to choosing is to compare it with the Nyquist frequency / c 
which would obtain if the N data points were evenly spaced over the same span T, that is 
f c = N/(2T). The accompanying program includes an input parameter hifac, defined as 
fhi/fc- The number of different frequencies Np returned by the program is then given by 


N P 


ofac x hifac 
2 


■N 


(13.8.9) 


(You have to remember to dimension the output arrays to at least this size.) 

The code does the trigonometric recurrences in double precision and embodies a few 
tricks with trigonometric identities, to decrease roundoff errors. If you are an aficionado of 
such things you can puzzle it out. A final detail is that equation (13.8.7) will fail because of 
roundoff error if 2 is too large; but equation (13.8.8) is fine in this regime. 


#include <math.h> 

#include "nrutil.h" 

#define TW0PID 6.2831853071795865 

void period(float x[], float y[], int n, float ofac, float hifac, float px[], 
float py[], int np, int *nout, int *jmax, float *prob) 

Given n data points with abscissas x[l. .n] (which need not be equally spaced) and ordinates 
y[l. .n], and given a desired oversampling factor ofac (a typical value being 4 or larger), 
this routine fills array px[l. .np] with an increasing sequence of frequencies (not angular 
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frequencies) up to hifac times the "average” Nyquist frequency, and fills array py[l. .np] 
with the values of the Lomb normalized periodogram at those frequencies. The arrays x and y 
are not altered, np, the dimension of px and py, must be large enough to contain the output, 
or an error results. The routine also returns jmax such that py [jmax] is the maximum element 
in py, and prob, an estimate of the significance of that maximum against the hypothesis of 
random noise. A small value of prob indicates that a significant periodic signal is present. 

{ 

void avevar(float data[], unsigned long n, float *ave, float *var); 
int i, j ; 

float ave,c,cc,cwtau,effm,expy,pnow,pymax,s,s s,sumc,sumcy,sums,sumsh, 
sumsy,swtau,var,wtau,xave,xdif,xmax,xmin,yy; 
double arg,wtemp,*wi,*wpi,*wpr,*wr; 


wi=dvector(l,n); 
wpi=dvector(1,n); 
wpr=dvector(1,n); 
wr=dvector(l,n); 


if (*nout > np) nrerror("output arrays too short in period"); 
avevar(y,n,&ave,&var); Get mean and variance of the input data. 

Lf (var == 0.0) nrerrorC'zero variance in Deriod 


xmax=xmin=x[l]; 
for (j=l;j<=n;j++) { 

if (x[j] > xmax) xmax=x[j]; 
if (x[j] < xmin) xmin=x[j]; 

> 

xdif=xmax-xmin; 
xave=0.5*(xmax+xmin); 
pymax=0.0; 

pnow=l.0/(xdif*ofac); 
for (j=l;j<=n;j++) { 

arg=TW0PID*((x[j]-xave)*pnow); 
wpr[j] = -2.0*SQR(sin(0.5*arg)); 
wpi [j] =sin(arg); 
wr[j]=cos(arg); 
wi [j]=wpi [j] ; 

> 

for (i=l;i<=(*nout);i++) { 
px[i]=pnow; 
sumsh=sumc=0.0; 
for (j=l;j<=n;j++) { 


Go through data to get the range of abscis- 


Starting frequency. 

Initialize values for the trigonometric recur¬ 
rences at each data point. The recur¬ 
rences are done in double precision. 


Main loop over the frequencies to be evalu¬ 
ated. 

First, loop over the data to get r and related 
quantities. 


s=wi [j] ; 

sumsh += s*c; 

sumc += (c-s)*(c+s); 

> 

wtau=0.5*atan2(2.0*sumsh,sumc); 
swtau=sin(wtau); 
cwtau=cos(wtau); 

sums=sumc=sumsy=sumcy=0.0; Then, loop over the data again to get the 

for (j=l;j<=n;j++) { periodogram value. 

s=wi [j] ; 
c=wr[j] ; 


cc=c*cwtau+s*swtau; 
sums += ss*ss; 
sumc += cc*cc; 




yy=y[j]-ave; 
sumsy += yy*ss; 
sumcy += yy*cc; 

wr [ j] = ((wtemp=wr[j])*wpr[j]-wi[j]*wpi[j])+wr[j]; 
wi [j] = (wi[j]*wpr[j]+wtemp*wpi[j])+wi[j]; 

.i]=0.5*(sumcy*sumcy/sumctsumsy*sumsy/sums)/var; 


Update the trigono¬ 
metric recurrences. 
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if (py[i] >= pymax) pymax=py[(*jmax=i)]; 

pnow += 1.0/(ofac*xdif) ; The next frequency. 

> 

expy=exp(-pymax) ; Evaluate statistical significance of the max- 

effm=2.0*(*nout)/ofac; imum. 

*prob=effm*expy; 

if (*prob > 0.01) *prob=1.0-pow(1.0-expy,effm); 
free_dvector(wr,1,n); 
free_dvector(wpr,l,n); 
free_dvector(wpi,1,n); 
free_dvector(wi,1,n); 


Fast Computation of the Lomb Periodogram 


We here show how equations (13.8.4) and (13.8.5) can be calculated — approximately, 
but to any desired precision — with an operation count only of order Np log Np. The 
method uses the FFT, but it is in no sense an FFT periodogram of the data. It is an actual 
evaluation of equations (13.8.4) and (13.8.5), the Lomb normalized periodogram, with exactly 
that method’s strengths and weaknesses. This fast algorithm, due to Press and Rybicki [ 6 ], 
makes feasible the application of the Lomb method to data sets at least as large as Iff points; 
it is already faster than straightforward evaluation of equations (13.8.4) and (13.8.5) for data 
sets as small as 60 or 100 points. 

Notice that the trigonometric sums that occur in equations (13.8.5) and (13.8.4) can be 
reduced to four simpler sums. If we define 


S h = - h) sin(c Jtj) C h = Y^(h 3 - h) cos {wtj) (13.8.10) 

3 =1 3=1 

N N 

62 = Y.sinQvti) C 2 = y~^cos(2 uitj) (13.8.11) 


^2(hj — h) cos uj(tj — r) = Ch cos ujt + Sh sinwr 
3 =1 
N 

^2(hj — h) sin uj(tj — r) = Sh cos ujt — Ch sinwr 
3 =1 

cos 2 u)(tj — r) = ^ + ^C 2 cos(2c or) + ^«S 2 sin(2a»r) 
3=1 

sin 2 uj(tj — t) = —— -C 2 cos(2cut) — —S 2 sin(2a»T) 


(13.8.12) 


Now notice that //'the tj s were evenly spaced, then the four quantities Sh, Ch, S 2 , and C 2 could 
be evaluated by two complex FFTs, and the results could then be substituted back through 
equation (13.8.12) to evaluate equations (13.8.5) and (13.8.4). The problem is therefore only 
to evaluate equations (13.8.10) and (13.8.11) for unevenly spaced data. 

Interpolation, or rather reverse interpolation — we will here call it extirpolation — 
provides the key. Interpolation, as classically understood, uses several function values on a 
regular mesh to construct an accurate approximation at an arbitrary point. Extirpolation, just 
the opposite, replaces a function value at an arbitrary point by several function values on a 
regular mesh, doing this in such a way that sums over the mesh are an accurate approximation 
to sums over the original arbitrary point. 

It is not hard to see that the weight functions for extirpolation are identical to those for 
interpolation. Suppose that the function h(t) to be extirpolated is known only at the discrete 
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(unevenly spaced) points h(ti) = hi, and that the function g(t) (which will be, e.g., cos wt) 
can be evaluated anywhere. Let t k be a sequence of evenly spaced points on a regular mesh. 
Then Lagrange interpolation (§3.1) gives an approximation of the form 

0 ) « E w k (t)g(tk) (13.8.13) 

k 

where Wk (t) are interpolation weights. Now let us evaluate a sum of interest by the following 
scheme: 


N N 

E hjgitj) » h :i 


,4«J. 


J2w k (tj)g(i k ) 

k 


E 


E hjWkitj ) 
j=i 


0k) = 0*) 

k 

(13.8.14) 


Here hk = J2j hjW k (tj). Notice that equation (13.8.14) replaces the original sum by one 
on the regular mesh. Notice also that the accuracy of equation (13.8.13) depends only on the 
fineness of the mesh with respect to the function g and has nothing to do with the spacing of the 
points tj or the function h; therefore the accuracy of equation (13.8.14) also has this property. 

The general outline of the fast evaluation method is therefore this: (i) Choose a mesh 
size large enough to accommodate some desired oversampling factor, and large enough to 
have several extirpolation points per half-wavelength of the highest frequency of interest, (ii) 
Extirpolate the values hi onto the mesh and take the FFT; this gives ,5), and Ch in equation 
(13.8.10). (iii) Extirpolate the constant values 1 onto another mesh, and take its FFT; this, 
with some manipulation, gives S 2 and C 2 in equation (13.8.11). (iv) Evaluate equations 
(13.8.12), (13.8.5), and (13.8.4), in that order. 

There are several other tricks involved in implementing this algorithm efficiently. You 
can figure most out from the code, but we will mention the following points: (a) A nice way 
to get transform values at frequencies 2u> instead of u> is to stretch the time-domain data by a 
factor 2, and then wrap it to double-cover the original length. (This trick goes back to Tukey.) 
In the program, this appears as a modulo function, (b) Trigonometric identities are used to 
get from the left-hand side of equation (13.8.5) to the various needed trigonometric functions 
of cur. C identifiers like (e.g.) cwt and hs2wt represent quantities like (e.g.) cos cur and 
| sin(2cur). (c) The function spread does extirpolation onto the M most nearly centered 
mesh points around an arbitrary point; its turgid code evaluates coefficients of the Lagrange 
interpolating polynomials, in an efficient manner. 


#include <math.h> 

#include "nrutil.h" 

#define M0D(a,b) while(a >= b) a -= b; Positive numbers only. 

#define MACC 4 Number of interpolation points per 1/4 

cycle of highest frequency. 

void fasper(float x[], float y[], unsigned long n, float ofac, float hifac, 
float wkl [] , float wk2[], unsigned long nwk, unsigned long *nout, 
unsigned long *jmax, float *prob) 

Given n data points with abscissas x[l. .n] (which need not be equally spaced) and ordinates 
y [1. .n] , and given a desired oversampling factor ofac (a typical value being 4 or larger), this 
routine fills array wkl[l. .nwk] with a sequence of nout increasing frequencies (not angular 
frequencies) up to hifac times the “average" Nyquist frequency, and fills array wk2[l. .nwk] 
with the values of the Lomb normalized periodogram at those frequencies. The arrays x and 
y are not altered, nwk, the dimension of wkl and wk2, must be large enough for intermediate 

work space, or an error results. The routine also returns jmax such that wk2 [jmax] is the 

maximum element in wk2, and prob, an estimate of the significance of that maximum against 
the hypothesis of random noise. A small value of prob indicates that a significant periodic 
signal is present. 

{ 

void avevar(float data[], unsigned long n, float *ave, float *var); 

void realft(float data[], unsigned long n, int isign); 

void spread(float y, float yy[], unsigned long n, float x, int m); 
unsigned long j,k,ndim,nfreq,nfreqt; 

float ave,ck,ckk,cterm,cwt,den,df,effm,expy,fac,fndim,hc2wt; 
float hs2wt,hypo,pmax,sterm,swt,var,xdif,xmax,xmin; 
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*nout=0.5*ofac*hifac*n; 

nfreqt=ofac*hifac*n*MACC; Size the FFT as next power of 2 above 

nfreq=64; nfreqt. 

while (nfreq < nfreqt) nfreq «= 1; 
ndim=nfreq « 1; 

if (ndim > nwk) nrerror("workspaces too small in fasper"); 
avevar(y,n,&ave,&var) ; Compute the mean, variance, and range 

if (var == 0.0) nrerror("zero variance in fasper"); of the data. 
xmin=x[1]; 
xmax=xmin; 

for (j=2;j<=n;j++) { 

if (x[j] < xmin) xmin=x[j]; 
if (x[j] > xmax) xmax=x[j]; 

> 

xdif=xmax-xmin; 

for (j=l; j<=ndim; j++) wkl[j] =wk2[j]=0.0; Zero the workspaces. 
fac=ndim/(xdif*ofac); 
fndim=ndim; 

for (j=l; j<=n; j++) { Extirpolate the data into the workspaces. 

ck=(x[j]-xmin)*f ac; 

MOD(ck,fndim) 
ckk=2.0*(ck++); 

MOD(ckk,fndim) 

++ckk; 

spread(y[j]-ave,wkl,ndim,ck,MACC); 
spread(l.0, wk2,ndim,ckk,MACC); 

> 

realft(wkl,ndim, 1); Take the Fast Fourier Transforms. 

realft(wk2,ndim,l); 
df=1.0/(xdif*ofac); 
pmax = -1.0; 

for (k=3,j=l; j<=(*nout) ;j++,k+=2) { Compute the Lomb value for each fre- 

hypo=sqrt (wk2 [k] *wk2 [k] +wk2 [k+1] *wk2 [k+1]); quency. 

hc2wt=0.5*wk2[k]/hypo; 
hs2wt=0.5*wk2[k+1]/hypo; 
cwt=sqrt(0.5+hc2wt); 
swt=SIGN(sqrt(0.5-hc2wt),hs2wt); 
den=0.5*n+hc2wt*wk2[k]+hs2wt*wk2[k+1]; 
cterm=SC)R(cwt*wkl[k]+swt*wkl[k+1])/den; 
sterm=SQR(cwt*wkl[k+1]-swt*wkl[k])/(n-den); 
wkl [j] =j*df; 

wk2 [j] = (cterm+sterm)/(2.0*var); 
if (wk2[j] > pmax) pmax=wk2[(*jmax=j)]; 

> 

expy=exp(-pmax); Estimate significance of largest peak value, 

ef f m=2.0* (*nout) /of ac; 

*prob=ef fm*expy; 

if (*prob > 0.01) *prob=1.0-pow(1.0-expy,effm); 


#include "nrutil.h" 

void spread(float y, float yy[], unsigned long n, float x, int m) 

Given an array yy [1. . n] , extirpolate (spread) a value y into m actual array elements that best 
approximate the “fictional” (i.e., possibly noninteger) array element number x. The weights 
used are coefficients of the Lagrange interpolating polynomial. 

{ 

int ihi,ilo,ix,j,nden; 

static long nfac[11]={0,1,1,2,6,24,120,720,5040,40320,362880}; 
float fac; 

if (m > 10) nrerror("factorial table too small in spread"); 
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ix=(int)x; 

if (x == (float)ix) yy[ix] += y; 
else { 

ilo=LMIN(LMAX((long)(x-0.5*m+l.0),1),n-m+l); 
ihi=ilo+m-l; 
nden=nfac[m]; 
fac=x-ilo; 

for (j=ilo+l;j<=ihi;j++) fac *= (x-j); 
yy[ihi] += y*fac/(nden*(x-ihi)); 
for (j=ihi-l;j>=ilo;j—) { 

nden=(nden/(j+l-ilo))*(j-ihi); 
yy[j] += y*fac/(nden*(x-j)); 

> 

> 

> 
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13.9 Computing Fourier Integrals Using the FFT 


Not uncommonly, one wants to calculate accurate numerical values for integrals of 
the form 


r b 

I = J e Mt h(t)dt , 

or the equivalent real and imaginary parts 

- = J cos(u>t)h(t)dt I s = J 


(13.9.1) 


I c = cos(uit)h(t)dt I s = sin(wf) h(t)dt , (13.9.2) 


and one wants to evaluate this integral for many different values of w. In cases of interest, h(t) 
is often a smooth function, but it is not necessarily periodic in [a, 6], nor does it necessarily 
go to zero at a or b. While it seems intuitively obvious that the force majeure of the FFT 
ought to be applicable to this problem, doing so turns out to be a surprisingly subtle matter, 
as we will now see. 

Let us first approach the problem naively, to see where the difficulty lies. Divide the 
interval [a, 6] into M subintervals, where M is a large integer, and define 


A SE 


b — a 
M 


tj 


a + j A , hj = h(tj), j = 0,..., M 


(13.9.3) 


Notice that ho = h{a) and h\i = h(b), and that there are M + 1 values hj. We can 
approximate the integral I by a sum. 


M-l 

7 « A hj exp (iu>tj) 

3=0 



(13.9.4) 
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which is at any rate first-order accurate. (If we centered the hj ’s and the t/s in the intervals, 
we could be accurate to second order.) Now for certain values of u> and M, the sum in 
equation (13.9.4) can be made into a discrete Fourier transform, or DFT, and evaluated by 
the fast Fourier transform (FFT) algorithm. In particular, we can choose M to be an integer 
power of 2, and define a set of special w’s by 


tu m A = 


2-7T m 

M 


(13.9.5) 


where m has the values m = 0, L ..., M /2 - I. Then equation (13.9.4) becomes 


M-l 

I(u>m) ** Ae iUma h je 2nirnj/M = Ae iWma [DFT(/i 0 •.. h M - i)] m (13.9.6) 

3=0 

Equation (13.9.6), while simple and clear, is emphatically not recommended for use: It is 
likely to give wrong answers! 

The problem lies in the oscillatory nature of the integral (13.9.1). If h(t) is at all smooth, 
and if u> is large enough to imply several cycles in the interval [a, b\ — in fact, oj rn in equation 
(13.9.5) gives exactly m cycles — then the value of 7 is typically very small, so small that 
it is easily swamped by first-order, or even (with centered values) second-order, truncation 
error. Furthermore, the characteristic “small parameter” that occurs in the error term is not 
A /(b — a) = 1 /M, as it would be if the integrand were not oscillatory, but ui A, which can be 
as large as tr for w’s within the Nyquist interval of the DFT (cf. equation 13.9.5). The result 
is that equation (13.9.6) becomes systematically inaccurate as u> increases. 

It is a sobering exercise to implement equation (13.9.6) for an integral that can be done 
analytically, and to see just how bad it is. We recommend that you try it. 

Let us therefore turn to a more sophisticated treatment. Given the sampled points hj, we 
can approximate the function h(t) everywhere in the interval [a, 6] by interpolation on nearby 
hj’ s. The simplest case is linear interpolation, using the two nearest hj’ s, one to the left and 
one to the right. A higher-order interpolation, e.g., would be cubic interpolation, using two 
points to the left and two to the right — except in the first and last subintervals, where we 
must interpolate with three hj’s on one side, one on the other. 

The formulas for such interpolation schemes are (piecewise) polynomial in the inde¬ 
pendent variable t, but with coefficients that are of course linear in the function values 
hj. Although one does not usually think of it in this way, interpolation can be viewed as 
approximating a function by a sum of kernel functions (which depend only on the interpolation 
scheme) times sample values (which depend only on the function). Let us write 

h(t) ^Y^hj ip (' A ^) • h i Fi (' A / ( ) (13.9.7) 

j—0 ^ ' j=endpoints ^ ' 

Here ip(s) is the kernel function of an interior point: It is zero for s sufficiently negative 
or sufficiently positive, and becomes nonzero only when s is in the range where the 
hj multiplying it is actually used in the interpolation. We always have ip(0) — 1 and 
ip(rri) = 0, m = ±1, ±2,..., since interpolation right on a sample point should give the 
sampled function value. For linear interpolation ip(s) is piecewise linear, rises from 0 to 1 
for s in (—1,0), and falls back to 0 for s in (0,1). For higher-order interpolation, ip(s) is 
made up piecewise of segments of Lagrange interpolation polynomials. It has discontinuous 
derivatives at integer values of s, where the pieces join, because the set of points used in 
the interpolation changes discretely. 

As already remarked, the subintervals closest to a and b require different (noncentered) 
interpolation formulas. This is reflected in equation (13.9.7) by the second sum, with the 
special endpoint kernels cpj ( s ). Actually, for reasons that will become clearer below, we have 
included all the points in the first sum (with kernel ip), so the (pj’s are actually differences 
between true endpoint kernels and the interior kernel ip. It is a tedious, but straightforward, 
exercise to write down all the ipj(s)’s for any particular order of interpolation, each one 
consisting of differences of Lagrange interpolating polynomials spliced together piecewise. 

Now apply the integral operator dt exp(iujt) to both sides of equation (13.9.7), 
interchange the sums and integral, and make the changes of variable s = (t — tj)/A in the 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 




586 


Chapter 13. Fourier and Spectral Applications 


first sum, s = (t — a)/ A in the second sum. The result is 


I « Ae*““ W{0) hje ije + £ h jaj (0) 

L j =0 j=endpoints 

Here 9 = w A, and the functions W (9) and o.j (9) are defined by 


w{0) = p 

aj (9) = 


dse i6s tP{s) 


ds e i0s ipj(s — j) 


(13.9.8) 


(13.9.9) 

(13.9.10) 


The key point is that equations (13.9.9) and (13.9.10) can be evaluated, analytically, 
once and for all, for any given interpolation scheme. Then equation (13.9.8) is an algorithm 
for applying “endpoint corrections” to a sum which (as we will see) can be done using the 
FFT, giving a result with high-order accuracy. 

We will consider only interpolations that are left-right symmetric. Then symmetry 
implies 

atM-j{0) = e ieM a*(0) = e iu(b ~ a) a*(0) (13.9.11) 

where * denotes complex conjugation. Also, ip(s) = ip(—s) implies that W{9) is real. 

Turn now to the first sum in equation (13.9.8), which we want to do by FFT methods. 
To do so, choose some N that is an integer power of 2 with N > M + 1. (Note that 
M need not be a power of two, so M = N — 1 is allowed.) If N > M + 1, define 
hj = 0, M + 1 < j < N — 1, i.e., “zero pad” the array of h :t "s so that j takes on the range 
0 < j < N — 1. Then the sum can be done as a DFT for the special values u> = u> n given by 

w„AsyS« n = 0,l,...,y —1 (13.9.12) 

For fixed M, the larger N is chosen, the finer the sampling in frequency space. The 
value M, on the other hand, determines the highest frequency sampled, since A decreases 
with increasing M (equation 13.9.3), and the largest value of cuA is always just under 7t 
(equation 13.9.12). In general it is advantageous to oversample by at least a factor of 4, i.e., 
N > 4 M (see below). We can now rewrite equation (13.9.8) in its final form as 


I{u> n ) = Ae iu>na | W(0) [DFT(ft 0 • • • h N - 1)]„ 

+ ao(0)ho + cti(0)/ii + cx2(9)h2 + oi3{9)h3 + ... 


+ + Ol 1 (9)hM -1 + OL 2 (0)hM-2 + Oi 3 (9)hM-3 + 

(13.9.13) 

For cubic (or lower) polynomial interpolation, at most the terms explicitly shown above 
are nonzero; the ellipses (...) can therefore be ignored, and we need explicit forms only for 
the functions W, ao, ai, 012 , 0 : 3 , calculated with equations (13.9.9) and (13.9.10). We have 
worked these out for you, in the trapezoidal (second-order) and cubic (fourth-order) cases. 
Here are the results, along with the first few terms of their power series expansions for small 9: 



Trapezoidal order: 


W(6) = 


2(1 — cos# 
ff 2 


1 1 1 

t - 1 - 0 2 - 0 

2 24 720 



d ® = 0.2 = 0.3 = 0 
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Cubic order: 

/6 + 0 2 \ 


W(G) = 


30 4 


(—42 + 50 2 ) + (6 + 0 2 )(8cos0 — cos 20) (-120 + 60 3 ) + (6 + 0 2 ) sin 20 

Q o(0) = -—j- hi - 


,_2 + ± ff 2 + ™_ e 

3 45 15120 


-^-0 

226800 


60 4 

-- —0 

2835 


14(3 - 0 2 )- 7(6 + 0 2 ) cos0 300 — 5(6 + 0 2 ) sin0 

r»l(0) = - —j -Hi 


60 4 

; ^-^ 0 2 + ^0 

24 180 3456 


—4(3 — 0 2 ) + 2(6 + 0 2 )cos0 —120 + 2(6 + 0 2 ) sin0 

a 2 (u) = -^-1- i - 


30 4 

6 45 6048 


30 4 


... 2(3 — 0 2 ) — (6 + 0 2 ) cos0 .60-(6 +0 2 ) sin0 

03(9) = -- h > 


+ 


60 4 

—-— d 

24192 


The program dftcor, below, implements the endpoint corrections for the cubic case. 
Given input values of w, A, a, b, and an array with the eight values ho,, h 3, Iim-3, ■ ■ ■, Hm, 
it returns the real and imaginary parts of the endpoint corrections in equation (13.9.13), and the 
factor W (0 ). The code is turgid, but only because the formulas above are complicated. The 
formulas have cancellations to high powers of 9. It is therefore necessary to compute the right- 
hand sides in double precision, even when the corrections are desired only to single precision. 

It is also necessary to use the series expansion for small values of 9. The optimal cross-over 
value of 9 depends on your machine’s wordlength, but you can always find it experimentally 
as the largest value where the two methods give identical results to machine precision. 

#include <math.h> 

void dftcor(float w, float delta, float a, float b, float endpts[], 
float *corre, float *corim, float *corfac) 

For an integral approximated by a discrete Fourier transform, this routine computes the cor¬ 
rection factor that multiplies the DFT and the endpoint correction to be added. Input is the 
angular frequency w, stepsize delta, lower and upper limits of the integral a and b, while the 
array endpts contains the first 4 and last 4 function values. The correction factor W{6) is 
returned as corfac, while the real and imaginary parts of the endpoint correction are returned 
as corre and corim. 

{ 

void nrerror(char error_text []); 

float aOi,aOr,ali,air,a2i,a2r,a3i,a3r,arg.CjdjCr.s.sljSr.t; 
float t2,t4,t6; 

double cth,ctth,spth2,sth,sth4i,stth,th,th2,th4,tmth2,tth4i; 
th=w*delta; 

if (a >= b || th < O.OeO I I th > 3.1416e0) nrerror("bad arguments to dftcor"); 
if (fabs(th) < 5.0e-2) { Use series. 

t=th; 
t2=t*t; 
t4=t2*t2; 
t6=t4*t2; 



S, § g 
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*corfac=l.0-(11.0/720.0)*t4+(23.0/15120.0)*t6; 

a0r=(-2.0/3.0)+t2/45.0+(103.0/15120.0)*t4-(169.0/226800.0)*t6; 

alr=(7.0/24.0)-(7.0/180.0)*t2+(5.0/3456.0)*t4-(7.0/259200.0)*t6; 

a2r=(-1.0/6.0)+t2/45.0-(5.0/6048.0)♦t4+t6/64800.0; 

a3r=(l.0/24.0)-t2/180.0+(5.0/24192.0)*t4-t6/259200.0; 

a0i=t*(2.0/45.0+(2.0/105.0)*t2-(8.0/2835.0)*t4+(86.0/467775.0)*t6); 

ali=t*(7.0/72.0-t2/168.0+(11.0/72576.0)*t4-(13.0/5987520.0)*t6); 

a2i=t*(-7.0/90.0+t2/210.0-(11.0/90720.0)*t4+(13.0/7484400.0)*t6); 

a3i=t*(7.0/360.0-t2/840.0+(11.0/362880.0)*t4-(13.0/29937600.0)*t6); 

> else { Use trigonometric formulas in double precision. 

cth=cos(th); 
sth=sin(th); 
ctth=cth*cth-sth*sth; 
stth=2.0e0*sth*cth; 
th2=th*th; 
th4=th2*th2; 
tmth2=3.0e0-th2; 
spth2=6.0e0+th2; 
sth4i=l.0/(6.0e0*th4); 
tth4i=2.0e0*sth4i; 

♦corfac=tth4i*spth2*(3.0e0-4.0e0*cth+ctth); 
a0r=sth4i+(-42.OeO+5.0e0*th2+spth2^(8.0e0*cth-ctth)); 
a0i=sth4i*(th*(-12.OeO+6.0e0*th2)+spth2*stth); 
alr=sth4i^(14.0e0*tmth2-7.0e0*spth2+cth); 
ali=sth4i^(30.0e0*th-5.0e0*spth2+sth); 
a2r=tth4i^(-4.0e0*tmth2+2.0e0*spth2^cth); 
a2i=tth4i^(-12.0e0*th+2.0e0*spth2^sth); 
a3r=sth4i^(2.0e0*tmth2-spth2+cth); 
a3i=sth4i^(6.0e0*th-spth2+sth); 

> 

cl=a0r+endpts[1]talr+endpts[2]+a2r+endpts[3]+a3r+endpts[4]; 
sl=a0i*endpts[1]+ali*endpts[2]+a2i*endpts[3]+a3i*endpts [4]; 
cr=a0r*endpts [8] +alr*endpts [7] +a2r*endpts [6] +a3r*endpts [5] ; 
sr = -a0i+endpts[8]-ali+endpts[7]-a2i+endpts[6]-a3i*endpts[5] ; 
arg=w*(b-a); 
c=cos(arg); 
s=sin(arg); 

♦corre=cl+c+cr-s^sr; 

♦corim=sl+s^cr+c^sr; 


Since the use of df tcor can be confusing, we also give an illustrative program df tint 
which uses df tcor to compute equation (13.9.1) for general a, b, u>, and h(L). Several points 
within this program bear mentioning: The parameters M and NDFT correspond to M and N 
in the above discussion. On successive calls, we recompute the Fourier transform only if 
a or 6 or h(t) has changed. 

Since dftint is designed to work for any value of u) satisfying u>A < tt, not just the 
special values returned by the DFT (equation 13.9.12), we do polynomial interpolation of 
degree MP0L on the DFT spectrum. You should be warned that a large factor of oversampling 
(TV M) is required for this interpolation to be accurate. After interpolation, we add the 
endpoint corrections from dftcor, which can be evaluated for any uj. 

While dftcor is good at what it does, dftint is illustrative only. It is not a general 
purpose program, because it does not adapt its parameters M, NDFT, MP0L, or its interpolation 
scheme, to any particular function h(t). You will have to experiment with your own application. 



#include <math.h> 

#include "nrutil.h" 

#define M 64 
#define NDFT 1024 
#define MP0L 6 

#define TW0PI (2.0^3.14159265) 
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The values of M, NDFT, and MPOL are merely illustrative and should be optimized for your 
particular application. M is the number of subintervals, NDFT is the length of the FFT (a power 
of 2), and MPOL is the degree of polynomial interpolation used to obtain the desired frequency 
from the FFT. 

void dftint(float (*func)(float), float a, float b, float w, float *cosint, 
float *sinint) 

Example program illustrating how to use the routine dftcor. The user supplies an external 
function func that returns the quantity h(t). The routine then returns cos(u>t)h(t) dt as 
cosint and F sin(cuf)fi(f) dt as sinint. 

{ 

void dftcor(float w, float delta, float a, float b, float endpts[], 
float *corre, float *corim, float *corfac); 
void polint(float xa[] , float ya[] , int n, float x, float *y, float *dy); 
void realft(float data[], unsigned long n, int isign); 
static int init=0; 
int j , nn; 

static float aold = -l.e30,bold = -1.e30,delta,(*funcold)(float); 
static float data[NDFT+l] ,endpts[9] ; 
float c,cdft,cerr,corfac,corim,corre,en,s; 
float sdft,serr,*cpol,*spol,*xpol; 

cpol=vector(1,MP0L); 
spol=vector(l,MP0L); 
xpol=vector(l,MP0L); 

if (init != 1 I I a != aold I I b != bold I I func != funcold) { 

Do we need to initialize, or is only ui changed? 

init=l; 

aold=a; 

bold=b; 

funcold=func; 

delta=(b-a)/M; 

Load the function values into the data array, 
for (j=l;j<=M+l;j++) 

data[j] = (*func) (a+(j-l)*delta) ; 

for (j =M+2; j <=NDFT; j ++) Zero pad the rest of the data array, 

data [J] =0.0; 

for (j=i;j<=4; j++) { Load the endpoints, 

endpts [j] =data[j] ; 
endpts[j+4]=data[M-3+j] ; 

¥ 

realft(data,NDFT,1); 

realft returns the unused value corresponding to tu^/ 2 in data[2]. We actually want 
this element to contain the imaginary part corresponding to tvo, which is zero, 
data [2] =0.0; 

> 

Now interpolate on the DFT result for the desired frequency. If the frequency is an ui n , 
i.e., the quantity en is an integer, then cdft=data[2*en-l] , sdft=data[2*en], and you 
could omit the interpolation. 
en=w*delta*NDFT/TWOPI+l.0; 

nn=IMIN(IMAX((int) (en-0.5*MP0L+1.0) ,1) ,NDFT/2-MP0L+l); Leftmost point for the 
for (j=l;j<=MP0L;j++,nn++) { interpolation, 

cpol[j]=data[2*nn-l] ; 
spol [j]=data[2*nn] ; 
xpol[j]=nn; 

> 

polint(xpol,cpol,MPOL,en,&cdft,&cerr); 
polint(xpol,spol,MPOL,en,&sdft,feserr); 

dftcor (w,delta,a,b,endpts,&corre,&corim,&corfac) ; Now get the endpoint cor- 

cdft *= corfac; rection and the mul- 

sdft *= corfac; tiplicative factor W{8). 

cdft += corre; 
sdft += corim; 
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c=delta*cos(w*a); Finally multiply by A and exp (iuia). 

s=delta*sin(w*a); 

*cosint=c*cdft-s*sdft; 

*sinint=s*cdft+c*sdft; 
free_vector(cpol,1,MP0L); 
free_vector(spol,l,MPOL); 
free_vector(xpol,1,MP0L); 


Sometimes one is interested only in the discrete frequencies 14 of equation (13.9.5), 
the ones that have integral numbers of periods in the interval [a, 6 ]. For smooth h(t), the 
value of I tends to be much smaller in magnitude at these w’s than at values in between, 
since the integral half-periods tend to cancel precisely. (That is why one must oversample for 
interpolation to be accurate: I(u>) is oscillatory with small magnitude near the w m ’s.) If you 
want these w m ’s without messy (and possibly inaccurate) interpolation, you have to set N to 
a multiple of M (compare equations 13.9.5 and 13.9.12). In the method implemented above, 
however, N must be at least M + 1, so the smallest such multiple is 2 M, resulting in a factor 
~2 unnecessary computing. Alternatively, one can derive a formula like equation (13.9.13), 
but with the last sample function Hm = h(b) omitted from the DFT, but included entirely in 
the endpoint correction for Iim . Then one can set M = N (an integer power of 2) and get the 
special frequencies of equation (13.9.5) with no additional overhead. The modified formula is 

I(w m ) 

■ ao{0)ho + ai(0)hi + a2(9)h,2 + a3(9)h,3 (13.9.14) 

+ “- 1 |a(0)/im + a 1 (6)hM -1 + a 2 (0)hM -2 + a 3 (0)hM-3 j 


Ae iUma { W{9)[DVT(h 0 ... hM-l)]; 


where 9 = cu m A and A(9) is given by 

A{9) = —ao(0) (13.9.15) 


for the trapezoidal case, or 


A{9 ) 


(-6 + 110 2 )+(6 + 9 2 ) cos 29 
(S0 3 


*lm[ao( 0 )] 


1 

3 




- — 9 
945 


*lm[ao( 0 )] 


(13.9.16) 


for the cubic case. 

Factors like W ( 9) arise naturally whenever one calculates Fourier coefficients of smooth 
functions, and they are sometimes called attenuation factors [1 ]. However, the endpoint 
corrections are equally important in obtaining accurate values of integrals. Narasimhan 
and Karthikeyan [2] have given a formula that is algebraically equivalent to our trapezoidal 
formula. However, their formula requires the evaluation of two FFTs, which is unnecessary. 
The basic idea used here goes back at least to Filon [3] in 1928 (before the FFT!). He used 
Simpson’s rule (quadratic interpolation). Since this interpolation is not left-right symmetric, 
two Fourier transforms are required. An alternative algorithm for equation (13.9.14) has been 
given by Lyness in [4]; for related references, see [5], To our knowledge, the cubic-order 
formulas derived here have not previously appeared in the literature. 

Calculating Fourier transforms when the range of integration is (— 00 , 00 ) can be tricky. 
If the function falls off reasonably quickly at infinity, you can split the integral at a large 
enough value of t. For example, the integration to + 00 can be written 


e iult h(i) dt = / e iult h(i) dt + / e iut h{t)dt 


j°° e iwt h(t)dt = J 

-l 


= / e lut h(t) dt — 


h( 6 )e i0 


h'(b)e i0 


(13.9.17) 



s o- i 
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The splitting point b must be chosen large enough that the remaining integral over (6, oo) is 
small. Successive terms in its asymptotic expansion are found by integrating by parts. The 
integral over (o, b) can be done using dftint. You keep as many terms in the asymptotic 
expansion as you can easily compute. See [6] for some examples of this idea. More 
powerful methods, which work well for long-tailed functions but which do not use the FFT, 
are described in [7-9], 
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13.10 Wavelet Transforms 

Like the fast Fourier transform (FFT), the discrete wavelet transform (DWT) is 
a fast, linear operation that operates on a data vector whose length is an integer power 
of two, transforming it into a numerically different vector of the same length. Also 
like the FFT, the wavelet transform is invertible and in fact orthogonal — the inverse 
transform, when viewed as a big matrix, is simply the transpose of the transform. 
Both FFT and DWT, therefore, can be viewed as a rotation in function space, from 
the input space (or time) domain, where the basis functions are the unit vectors e,, 
or Dirac delta functions in the continuum limit, to a different domain. For the FFT, 
this new domain has basis functions that are the familiar sines and cosines. In the 
wavelet domain, the basis functions are somewhat more complicated and have the 
fanciful names “mother functions” and “wavelets.” 

Of course there are an infinity of possible bases for function space, almost all of 
them uninteresting! What makes the wavelet basis interesting is that, unlike sines and 
cosines, individual wavelet functions are quite localized in space; simultaneously, 
like sines and cosines, individual wavelet functions are quite localized in frequency 
or (more precisely) characteristic scale. As we will see below, the particular kind 
of dual localization achieved by wavelets renders large classes of functions and 
operators sparse, or sparse to some high accuracy, when transformed into the wavelet 
domain. Analogously with the Fourier domain, where a class of computations, like 
convolutions, become computationally fast, there is a large class of computations 
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— those that can take advantage of sparsity — that become computationally fast 
in the wavelet domain [1 ]. 

Unlike sines and cosines, which define a unique Fourier transform, there is 
not one single unique set of wavelets; in fact, there are infinitely many possible 
sets. Roughly, the different sets of wavelets make different trade-offs between 
how compactly they are localized in space and how smooth they are. (There are 
further fine distinctions.) 

Daubechies Wavelet Filter Coefficients 



A particular set of wavelets is specified by a particular set of numbers, called 
wavelet filter coefficients. Here, we will largely restrict ourselves to wavelet filters 
in a class discovered by Daubechies [2], This class includes members ranging from 
highly localized to highly smooth. The simplest (and most localized) member, often 
called DAUB 4 , has only four coefficients, Co, -.., C3. For the moment we specialize 
to this case for ease of notation. 

Consider the following transformation matrix acting on a column vector of 
data to its right: 

Co Cl C2 C3 

C3 —C2 Cl —Co 

Co Cl C 2 C 3 

C3 -C2 Cl -Co 

Co Cl C2 
C3 c 2 Cl 

C2 C3 Co 

Cl -Co c 3 


( 13 . 10 . 1 ) 

C 3 

-Co 

Cl 

-C2- 



Here blank entries signify zeroes. Note the structure of this matrix. The first row 
generates one component of the data convolved with the filter coefficients c 0 ..., C3. 
Likewise the third, fifth, and other odd rows. If the even rows followed this pattern, 
offset by one, then the matrix would be a circulant, that is, an ordinary convolution 
that could be done by FFT methods. (Note how the last two rows wrap around 
like convolutions with periodic boundary conditions.) Instead of convolving with 
co, ■ • •, C3, however, the even rows perform a different convolution, with coefficients 
C3, — C2, ci, — Co. The action of the matrix, overall, is thus to perform two related 
convolutions, then to decimate each of them by half (throw away half the values), 
and interleave the remaining halves. 

It is useful to think of the filter co, • • •, C3 as being a smoothing filter, call it H, 
something like a moving average of four points. Then, because of the minus signs, 
the filter C3, — C2, ci, — co, call it G, is not a smoothing filter. (In signal processing 
contexts, H and G are called quadrature mirror filters [3].) In fact, the c’s are chosen 
so as to make G yield, insofar as possible, a zero response to a sufficiently smooth 
data vector. This is done by requiring the sequence C3, — C2, ci, — co to have a certain 
number of vanishing moments. When this is the case for p moments (starting with 
the zeroth), a set of wavelets is said to satisfy an “approximation condition of order 
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p." This results in the output of H, decimated by half, accurately representing the 
data’s “smooth” information. The output of G, also decimated, is referred to as 
the data’s “detail” information [4], 

For such a characterization to be useful, it must be possible to reconstruct the 
original data vector of length N from its N/2 smooth or s-components and its N/2 
detail or d-components. That is effected by requiring the matrix (13.10.1) to be 
orthogonal, so that its inverse is just the transposed matrix 


Co 

C3 



C2 Cl 

Cl 

—c 2 



C3 “CO 

C 2 

Cl 

Co 

C3 


C3 

-Co 

Cl 

c 2 



(13.10.2) 


C2 

Cl c 0 

C3 



C3 

-c 0 Ci 

c 2 




C2 

Cl 

Co 

C3 


C3 

-Co 

Cl 

-c 2 


One sees immediately that matrix (13.10.2) is inverse to matrix (13.10.1) if and 
only if these two equations hold, 


Co + C 1 + C 2 + c 3 = 1 
C2C0 + C 3 C 1 = 0 


(13.10.3) 


If additionally we require the approximation condition of order p = 2, then two 
additional relations are required, 


C 3 - c 2 + ci - c 0 = 0 
0c 3 - lc 2 + 2ci - 3c 0 = 0 


(13.10.4) 


Equations (13.10.3) and (13.10.4) are 4 equations for the 4 unknowns Co, _c. 3 , 

first recognized and solved by Daubechies. The unique solution (up to a left-right 
reversal) is 

C 0 = (1 +: VI) / 4 V 2 Cl = (3 + V3) /4\/2 

7 _ 7 _ (13.10.5) 

c 2 = (3 - y3)/4\/2 c 3 = (1 - \/3)/4\/2 

In fact, DAUB4 is only the most compact of a sequence of wavelet sets: If we had 
six coefficients instead of four, there would be three orthogonality requirements in 
equation (13.10.3) (with offsets of zero, two, and four), and we could require the 
vanishing ofp = 3 moments in equation (13.10.4). In this case, DAUB6, the solution 
coefficients can also be expressed in closed form. 



CO = (1 + vTo + vW^M)/16V2 

C2 = (10 - 2VT0 + 2\/h + 2V w)/l&V2 
C4 = (5 + vTo - 3\/5.+ 2yi0)/16v^ 


(5 + vTo + 3-^5 + 2yTo)/i6V2 

c 3 = (10 - 2vTd 2\/r)-2v'id)/if>V2 
C5 = (1 + vTo - i/s + 2-v/lO I/16 a/2 


(13.10.6) 
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For higher p, up to 10, Daubechies [2] has tabulated the coefficients numerically. The 
number of coefficients increases by two each time p is increased by one. 

Discrete Wavelet Transform 


We have not yet defined the discrete wavelet transform (DWT), but we are 
almost there: The DWT consists of applying a wavelet coefficient matrix like 
(13.10.1) hierarchically, first to the full data vector of length TV, then to the “smooth” 
vector of length TV/2, then to the “smooth-smooth” vector of length TV/4, and 
so on until only a trivial number of “smooth-.. .-smooth” components (usually 2) 
remain. The procedure is sometimes called a pyramidal algorithm [4], for obvious 
reasons. The output of the DWT consists of these remaining components and all 
the “detail” components that were accumulated along the way. A diagram should 
make the procedure clear: 


‘ 2/1 ' 


r si i 


■Si - 


r Si i 


f Si 1 


rsi 1 

2/2 


di 


S2 


D i 


S 2 

etc. 

<s 2 

2/3 


S2 


S3 


s 2 


s 3 

W 

2/4 

2/5 


d 2 

S3 


54 

55 

13.10.1 

d 2 

S 3 

permute 

Si 

DI 


d 2 

Di 

2/6 


dz 


S6 


D 3 


d 2 


d 2 

2/7 


S4 


S7 


Si 


d 3 


d 3 

2/8 

2/9 

13.10.1 

d 4 

S5 

permute 

S8 

di 


Di 

4T 


Di 

W 


Di 

dT 

2/10 


dz 


d 2 


d 2 


d 2 


d 2 

2 /n 


S6 


dz 


d 3 


d 3 


d 3 

2/12 


de, 


(I 4 


di 


di 


d 4 

2/13 


S7 


d$ 


d 5 


ds 


d§ 

2/14 


d 7 


d 6 


d 6 


ds 


d 6 

2/15 


S8 


d 7 


d 7 


d7 


dj 

L//16 J 


-dg - 


-d 8 - 


Ids -1 


- d 8 - 


. d 8 - 


(13.10.7) 

If the length of the data vector were a higher power of two, there would be 
more stages of applying (13.10.1) (or any other wavelet coefficients) and permuting. 
The endpoint will always be a vector with two S ’s and a hierarchy of D’s, D’s, 
d’s, etc. Notice that once d’s are generated, they simply propagate through to all 
subsequent stages. 

A value d, of any level is termed a “wavelet coefficient” of the original data 
vector; the final values <Si, S 2 should strictly be called “mother-function coefficients,” 
although the term “wavelet coefficients” is often used loosely for both d’s and final 
«S’s. Since the full procedure is a composition of orthogonal linear operations, the 
whole DWT is itself an orthogonal linear operator. 

To invert the DWT, one simply reverses the procedure, starting with the smallest 
level of the hierarchy and working (in equation 13.10.7) from right to left. The 
inverse matrix (13.10.2) is of course used instead of the matrix (13.10.1). 

As already noted, the matrices (13.10.1) and (13.10.2) embody periodic (“wrap¬ 
around”) boundary conditions on the data vector. One normally accepts this as a 
minor inconvenience: the last few wavelet coefficients at each level of the hierarchy 
are affected by data from both ends of the data vector. By circularly shifting the 
matrix (13.10.1) TV/2 columns to the left, one can symmetrize the wrap-around; 
but this does not eliminate it. It is in fact possible to eliminate the wrap-around 
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completely by altering the coefficients in the first and last N rows of (13.10.1), 
giving an orthogonal matrix that is purely band-diagonal [5], This variant, beyond 
our scope here, is useful when, e.g., the data varies by many orders of magnitude 
from one end of the data vector to the other. 

Here is a routine, wtl, that performs the pyramidal algorithm (or its inverse 
if isign is negative) on some data vector a[l. .n]. Successive applications of 
the wavelet filter, and accompanying permutations, are done by an assumed routine 
wtstep, which must be provided. (We give examples of several different wtstep 
routines just below.) 

void vtl(float a[], unsigned long n, int isign, 

void (*wtstep)(float [], unsigned long, int)) 

One-dimensional discrete wavelet transform. This routine implements the pyramid algorithm, 
replacing a[l. .n] by its wavelet transform (for isign=l), or performing the inverse operation 
(for isign=-l). Note that n MUST be an integer power of 2. The routine wtstep, whose 
actual name must be supplied in calling this routine, is the underlying wavelet filter. Examples 
of wtstep are daub4 and (preceded by pwtset) pwt. 

{ 

unsigned long nn; 

if (n < 4) return; 

if (isign >= 0) { Wavelet transform. 

for (nn=n;nn>=4;nn»=l) (*wtstep)(a,nn,isign); 

Start at largest hierarchy, and work towards smallest. 

> else { Inverse wavelet transform. 

for (nn=4;nn<=n;nn«=l) (*wtstep) (a,nn,isign); 

Start at smallest hierarchy, and work towards largest. 

> 

> 


Here, as a specific instance of wtstep, is a routine for the DAUB4 wavelets: 


#include "nrutil.h" 

#define CO 0.4829629131445341 
#define Cl 0.8365163037378079 
#define C2 0.2241438680420134 
#define C3 -0.1294095225512604 

void daub4(float a[], unsigned long n, int isign) 

Applies the Daubechies 4-coefficient wavelet filter to data vector a[l. .n] (for isign=l) or 
applies its transpose (for isign=-l). Used hierarchically by routines wtl and wtn. 

{ 

float *wksp; 

unsigned long nh,nhl,i,j; 

if (n < 4) return; 
wksp=vector(1,n); 
nhl=(nh=n » 1)+1; 

if (isign >= 0) { Apply filter, 

for (i=l,j=l;j<=n-3;j+=2,i++) { 

wksp [i] =C0*a[j] +Cl*a[j+1] +C2*a[j +2] +C3*a[j+3] ; 
wksp[i+nh] = C3*a[j]-C2*a[j+l]+Cl*a[j+2]-C0*a[j+3] ; 

} 

wksp [i] =C0*a[n-l] +Cl*a[n] +C2*a[l] +C3*a[2] ; 
wksp[i+nh] = C3*a[n-1]-C2*a[n]+Cl*a[l]-C0*a[2] ; 

> else { Apply transpose filter, 

wksp [1] =C2*a [nh] +Cl*a [n] +C0*a [1] +C3*a [nhl] ; 
wksp[2] = C3*a[nh]-C0*a[n]+Cl*a[l]-C2*a[nhl] ; 
for (i=l,j=3;i<nh;i++) { 
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wksp [j ++] =C2*a [i] +Cl*a [i+nh] +CO*a [i+1] +C3*a [i+nhl] ; 
wksp[j++] = C3*a[i]-CO*a[i+nh]+Cl*a[i+1]-C2*a[i+nhl]; 

> 

> 

for (i=l;i<=n;i++) a[i]=wksp[i]; 
free_vector(wksp,l,n); 


For larger sets of wavelet coefficients, the wrap-around of the last rows or 
columns is a programming inconvenience. An efficient implementation would 
handle the wrap-arounds as special cases, outside of the main loop. Here, we will 
content ourselves with a more general scheme involving some extra arithmetic at 
run time. The following routine sets up any particular wavelet coefficients whose 
values you happen to know. 

typedef struct { 

int ncof,ioff,joff; 
float *cc,*cr; 

} wavefilt; 

wavefilt wfilt; Defining declaration of a structure, 

void pwtset(int n) 

Initializing routine for pwt, here implementing the Daubechies wavelet filters with 4, 12, and 
20 coefficients, as selected by the input value n. Further wavelet filters can be included in the 
obvious manner. This routine must be called (once) before the first use of pwt. (For the case 
n=4, the specific routine daub4 is considerably faster than pwt.) 

{ 

void nrerror(char error_text[]); 
int k; 

float sig = -1.0; 

static float c4[5]={0.0,0.4829629131445341,0.8365163037378079, 

0.2241438680420134, -0.1294095225512604}; 
static float cl2[13]={0.0,0.111540743350, 0.494623890398, 0.751133908021, 

0.315250351709, -0.226264693965, -0.129766867567, 

0.097501605587, 0.027522865530,-0.031582039318, 

0.000553842201, 0.004777257511,-0.001077301085}; 
static float c20[21]={0.0,0.026670057901, 0.188176800078, 0.527201188932, 

0.688459039454, 0.281172343661, -0.249846424327, 

-0.195946274377, 0.127369340336, 0.093057364604, 

-0.071394147166,-0.029457536822, 0.033212674059, 

0.003606553567,-0.010733175483, 0.001395351747, 

0.001992405295,-0.000685856695,-0.000116466855, 

0.000093588670,-0.000013264203}; 
static float c4r [5] ,cl2r [13] ,c20r[21] ; 

wf ilt.ncof=n; 
if (n == 4) { 
wfilt.cc=c4; 
wfilt.cr=c4r; 

} 

else if (n == 12) { 
wfilt.cc=cl2; 
wfilt.cr=cl2r; 

} 

else if (n == 20) { 
wfilt.cc=c20; 
wfilt.cr=c20r; 

} 

else nrerror("unimplemented value n in pwtset"); 
for (k=l;k<=n;k++) { 
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wfilt.cr[wfilt.ncof+l-k]=sig*wfilt.cc[k]; 
sig = -sig; 

> 

wfilt.ioff = wfilt.joff = -(n » 1); 

These values center the “support” of the wavelets at each level. Alternatively, the "peaks" 
of the wavelets can be approximately centered by the choices ioff=-2 and joff=-n+2. 
Note that daub4 and pwtset with n=4 use different default centerings. 


Once pwtset has been called, the following routine can be used as a specific 
instance of wtstep. 

#include "nrutil.h" 


typedef struct { 

int ncof,ioff,joff; 
float *cc,*cr; 

} wavefilt; 


extern wavefilt wfilt; Defined in pwtset. 

void pwt(float a[], unsigned long n, int isign) 

Partial wavelet transform: applies an arbitrary wavelet filter to data vector a [1. .n] (for isign = 
1) or applies its transpose (for isign = —1). Used hierarchically by routines wtl and wtn. 
The actual filter is determined by a preceding (and required) call to pwtset, which initializes 
the structure wfilt. 

{ 

float ai,ail,*wksp; 

unsigned long i,ii, j , jf, jr,k,nl,ni,nj ,rLh,nmod; 


if (n < 4) return; 
wksp=vector(1,n); 
nmod=wf ilt. ncof *n; 
nl=n-l; 
nh=n » 1; 

for (j=l;j<=n;j++) wksp[j]=0.0; 
if (isign >= 0) { 

for (ii=l,i=l;i<=n;i+=2,ii++) { 


A positive constant equal to zero mod n. 
Mask of all bits, since n a power of 2. 


Apply filter. 


ni=i+nmod+wf ilt. iof f; Pointer to be incremented and wrapped-around. 

nj=i+nmod+wfilt.j of f; 

for (k=l;k<=wfilt.ncof;k++) { 

jf=nl & (ni+k); We use bitwise and to wrap-around the point- 

jr=nl & (nj+k); ers. 

wkspfii] += wf ilt. cc [k] *a[jf+l] ; 
wksp[ii+nh] += wf ilt. cr [k] *a[jr+l] ; 

> 


> 

> else { Apply transpose filter, 

for (ii=l,i=l;i<=n;i+=2,ii++) { 


ail=a[ii+nh]; 

ni=i+nmod+wf ilt. ioff; See comments above, 

nj=i+nmod+wfilt.j of f; 
for (k=l;k<=wfilt.ncof;k++) { 
jf=(nl & (ni+k))+l; 
jr=(nl & (nj+k))+l; 
wkspfjf] += wfilt.cc[k]*ai; 
wkspfjr] += wfilt.cr[k]*ail; 

> 

> 

> 

for (j=l; j<=n; j++) a[j]=wksp[j] ; Copy the results back from workspace, 

free_vector(wksp,1,n); 
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Figure 13.10.1. Wavelet functions, that is, single basis functions from the wavelet families DAUB4 
and DAUB20. A complete, orthonormal wavelet basis consists of scalings and translations of either one 
of these functions. DAUB4 has an infinite number of cusps; DAUB20 would show similar behavior 
in a higher derivative. 


What Do Wavelets Look Like? 

We are now in a position actually to see some wavelets. To do so, we simply 
run unit vectors through any of the above discrete wavelet transforms, with isign 
negative so that the inverse transform is performed. Figure 13.10.1 shows the 
DAUB4 wavelet that is the inverse DWT of a unit vector in the 5th component of a 
vector of length 1024, and also the DAUB20 wavelet that is the inverse of the 22nd 
component. (One needs to go to a later hierarchical level for DAUB20, to avoid a 
wavelet with a wrapped-around tail.) Other unit vectors would give wavelets with 
the same shapes, but different positions and scales. 

One sees that both DAUB4 and DAUB20 have wavelets that are continuous. 
DAUB20 wavelets also have higher continuous derivatives. DAUB4 has the peculiar 
property that its derivative exists only almost everywhere. Examples of where it fails 
to exist are the points p/2 n , where p and n are integers; at such points, DAUB4 is 
left differentiable, but not right differentiable! This kind of discontinuity — at least 
in some derivative — is a necessary feature of wavelets with compact support, like 
the Daubechies series. For every increase in the number of wavelet coefficients by 
two, the Daubechies wavelets gain about half a derivative of continuity. (But not 
exactly half; the actual orders of regularity are irrational numbers!) 

Note that the fact that wavelets are not smooth does not prevent their having 
exact representations for some smooth functions, as demanded by their approximation 
order p. The continuity of a wavelet is not the same as the continuity of functions 
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Figure 13.10.2. More wavelets, here generated from the sum of two unit vectors, eio + 65 s, which 
are in different hierarchical levels of scale, and also at different spatial positions. DAUB4 wavelets (a) 
are defined by a filter in coordinate space (equation 13.10.5), while Lemarie wavelets (b) are defined by 
a filter most easily written in Fourier space (equation 13.10.14). 

that a set of wavelets can represent. For example, DAUB4 can represent (piecewise) 
linear functions of arbitrary slope: in the correct linear combinations, the cusps all 
cancel out. Every increase of two in the number of coefficients allows one higher 
order of polynomial to be exactly represented. 

Figure 13.10.2 shows the result of performing the inverse DWT on the input 
vector eio + e 58 , again for the two different particular wavelets. Since 10 lies early 
in the hierarchical range of 9 — 16, that wavelet lies on the left side of the picture. 
Since 58 lies in a later (smaller-scale) hierarchy, it is a narrower wavelet; in the range 
of 33-64 it is towards the end, so it lies on the right side of the picture. Note that 
smaller-scale wavelets are taller, so as to have the same squared integral. 

Wavelet Filters in the Fourier Domain 

The Fourier transform of a set of filter coefficients c j is given by 

H(u) = ^2 Cj e iju (13.10.8) 

3 

Here H is a function periodic in 27r, and it has the same meaning as before: It is 
the wavelet filter, now written in the Fourier domain. A very useful fact is that the 
orthogonality conditions for the c’s (e.g., equation 13.10.3 above) collapse to two 
simple relations in the Fourier domain. 



||^( 0)| 2 = 1 


(13.10.9) 
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and 

^[\H(cj)\ 2 + \H(cj + n)\ 2 ] =1 (13.10.10) 

Likewise the approximation condition of order p (e.g., equation 13.10.4 above) 
has a simple formulation, requiring that H[w) have a pth order zero at uj = tv, 
or (equivalently) 

ijM( 7r )= 0 m = 0,l,...,p-l (13.10.11) 

It is thus relatively straightforward to invent wavelet sets in the Fourier domain. 
You simply invent a function H(u>) satisfying equations (13.10.9)—(13.10.11). To 
find the actual c/s applicable to a data (or s-component) vector of length N, and 
with periodic wrap-around as in matrices (13.10.1) and (13.10.2), you invert equation 
(13.10.8) by the discrete Fourier transform 

% = I- IT //( 2 !,V -W* (13.10.12) 

v fc=o 

The quadrature mirror filter G (reversed Cj’s with alternating signs), incidentally, 
has the Fourier representation 

G{u) = e~ iu H*(u> + 7r) (13.10.13) 


where asterisk denotes complex conjugation. 

In general the above procedure will not produce wavelet filters with compact 
support. In other words, all N of the cj\ j = 0,.... N ■ I will in general be 
nonzero (though they may be rapidly decreasing in magnitude). The Daubechies 
wavelets, or other wavelets with compact support, are specially chosen so that H(u>) 
is a trigonometric polynomial with only a small number of Fourier components, 
guaranteeing that there will be only a small number of nonzero c/s. 

On the other hand, there is sometimes no particular reason to demand compact 
support. Giving it up in fact allows the ready construction of relatively smoother 
wavelets (higher values of p). Even without compact support, the convolutions 
implicit in the matrix (13.10.1) can be done efficiently by FFT methods. 

Lemarie’s wavelet (see [4]) has p = 4, does not have compact support, and is 
defined by the choice of H(oS), 


H(u) = 


, 4 315 — 420u + 126u 2 — 4u 3 l 1/2 
U 315 — 420i> + 126u 2 — 4v 3 


(13.10.14) 


where 

u = sin 2 ^ v = sin 2 u> (13.10.15) 

It is beyond our scope to explain where equation (13.10.14) comes from. An 
informal description is that the quadrature mirror filter G(u>) deriving from equation 
(13.10.14) has the property that it gives identically zero when applied to any function 
whose odd-numbered samples are equal to the cubic spline interpolation of its 
even-numbered samples. Since this class of functions includes many very smooth 
members, it follows that H(u) does a good job of truly selecting a function’s smooth 
information content. Sample Lemarie wavelets are shown in Figure 13.10.2. 
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Truncated Wavelet Approximations 

Most of the usefulness of wavelets rests on the fact that wavelet transforms 
can usefully be severely truncated, that is, turned into sparse expansions. The case 
of Fourier transforms is different: FFTs are ordinarily used without truncation, 
to compute fast convolutions, for example. This works because the convolution 
operator is particularly simple in the Fourier basis. There are not, however, any 
standard mathematical operations that are especially simple in the wavelet basis. 

To see how truncation works, consider the simple example shown in Figure 
13.10.3. The upper panel shows an arbitrarily chosen test function, smooth except 
for a square-root cusp, sampled onto a vector of length 2 10 . The bottom panel (solid 
curve) shows, on a log scale, the absolute value of the vector’s components after it has 
been run through the DAUB4 discrete wavelet transform. One notes, from right to 
left, the different levels of hierarchy, 513-1024,257-512,129-256, etc. Within each 
level, the wavelet coefficients are non-negligible only very near the location of the 
cusp, or very near the left and right boundaries of the hierarchical range (edge effects). 

The dotted curve in the lower panel of Figure 13.10.3 plots the same amplitudes 
as the solid curve, but sorted into decreasing order of size. One can read off, for 
example, that the 130th largest wavelet coefficient has an amplitude less than 10 “ 5 
of the largest coefficient, whose magnitude is ~ 10 (power or square integral ratio 
less than 10 -10 ). Thus, the example function can be represented quite accurately 
by only 130, rather than 1024, coefficients — the remaining ones being set to 
zero. Note that this kind of truncation makes the vector sparse, but not shorter 
than 1024. It is very important that vectors in wavelet space be truncated according 
to the amplitude of the components, not their position in the vector. Keeping the 
first 256 components of the vector (all levels of the hierarchy except the last two) 
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would give an extremely poor, and jagged, approximation to the function. When 
you compress a function with wavelets, you have to record both the values and the 
positions of the nonzero coefficients. 

Generally, compact (and therefore unsmooth) wavelets are better for lower 
accuracy approximation and for functions with discontinuities (like edges), while 
smooth (and therefore noncompact) wavelets are better for achieving high numerical 
accuracy. This makes compact wavelets a good choice for image compression, for 
example, while it makes smooth wavelets best for fast solution of integral equations. 

Wavelet Transform in Multidimensions 

A wavelet transform of a d-dimensional array is most easily obtained by 
transforming the array sequentially on its first index (for all values of its other indices), 
then on its second, and so on. Each transformation corresponds to multiplication 
by an orthogonal matrix. By matrix associativity, the result is independent of the 
order in which the indices were transformed. The situation is exactly like that for 
multidimensional FFTs. A routine for effecting the multidimensional DWT can thus 
be modeled on a multidimensional FFT routine like f ourn: 

#include "nrutil.h" 

void wtn(float a[] , unsigned long nn[] , int ndim, int isign, 
void (*wtstep)(float [], unsigned long, int)) 

Replaces a by its ndim-dimensional discrete wavelet transform, if isign is input as 1. Here 
nn[l. .ndim] is an integer array containing the lengths of each dimension (number of real 
values), which MUST all be powers of 2. a is a real array of length equal to the product of 
these lengths, in which the data are stored as in a multidimensional real array. If isign is input 
as —1, a is replaced by its inverse wavelet transform. The routine wtstep, whose actual name 
must be supplied in calling this routine, is the underlying wavelet filter. Examples of wtstep 
are daub4 and (preceded by pwtset) pwt. 

{ 

unsigned long il,i2,i3,k,n,nnew,nprev=l,nt,ntot=l; 
int idim; 
float *wksp; 

for (idim=l;idim<=ndim;idim++) ntot *= nn[idim]; 
wksp=vector(1,ntot); 

for (idim=l ; idim<=ndim; idim++) { Main loop over the dimensions. 

n=nn[idim]; 
nnew=n*nprev; 
if (n > 4) { 

for (i2=0;i2<ntot;i2+=nnew) { 
for (il=l;il<=nprev;il++) { 

for (i3=il+i2,k=l;k<=n;k++,i3+=nprev) wksp[k]=a[i3]; 

Copy the relevant row or column or etc. into workspace. 

if (isign >= 0) { Do one-dimensional wavelet transform, 

for (nt=n;nt>=4;nt >>= 1) 

(*wtstep)(wksp,nt,isign); 

> else { Or inverse transform. 

for(nt=4;nt<=n;nt <<= 1) 

(*wtstep)(wksp,nt,isign); 

> 

for (i3=il+i2,k=l;k<=n;k++,i3+=nprev) a[i3]=wksp[k]; 

Copy back from workspace. 

> 

> 

> 
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nprev=nnew; 

> 

free_vector(wksp,l,ntot); 


Here, as before, wtstep is an individual wavelet step, either daub4 or pwt. 

Compression of Images 

An immediate application of the multidimensional transform wtn is to image 
compression. The overall procedure is to take the wavelet transform of a digitized 
image, and then to “allocate bits” among the wavelet coefficients in some highly 
nonuniform, optimized, manner. In general, large wavelet coefficients get quantized 
accurately, while small coefficients are quantized coarsely with only a bit or two 
— or else are truncated completely. If the resulting quantization levels are still 
statistically nonuniform, they may then be further compressed by a technique like 
Huffman coding (§20.4). 

While a more detailed description of the “back end” of this process, namely the 
quantization and coding of the image, is beyond our scope, it is quite straightforward 
to demonstrate the “front-end” wavelet encoding with a simple truncation: We keep 
(with full accuracy) all wavelet coefficients larger than some threshold, and we delete 
(set to zero) all smaller wavelet coefficients. We can then adjust the threshold to 
vary the fraction of preserved coefficients. 

Figure 13.10.4 shows a sequence of images that differ in the number of wavelet 
coefficients that have been kept. The original picture (a), which is an official IEEE 
test image, has 256 by 256 pixels with an 8-bit grayscale. The two reproductions 
following are reconstructed with 23% (b) and 5.5% (c) of the 65536 wavelet 
coefficients. The latter image illustrates the kind of compromises made by the 
truncated wavelet representation. High-contrast edges (the model’s right cheek and 
hair highlights, e.g.) are maintained at a relatively high resolution, while low-contrast 
areas (the model’s left eye and cheek, e.g.) are washed out into what amounts to 
large constant pixels. Figure 13.10.4 (d) is the result of performing the identical 
procedure with Fourier, instead of wavelet, transforms: The figure is reconstructed 
from the 5.5% of 65536 real Fourier components having the largest magnitudes. 
One sees that, since sines and cosines are nonlocal, the resolution is uniformly poor 
across the picture; also, the deletion of any components produces a mottled “ringing” 
everywhere. (Practical Fourier image compression schemes therefore break up an 
image into small blocks of pixels, 16 x 16, say, and do rather elaborate smoothing 
across block boundaries when the image is reconstructed.) 

Fast Solution of Linear Systems 

One of the most interesting, and promising, wavelet applications is linear 
algebra. The basic idea [1 ] is to think of an integral operator (that is, a large matrix) as 
a digital image. Suppose that the operator compresses well under a two-dimensional 
wavelet transform, i.e., that a large fraction of its wavelet coefficients are so small 
as to be negligible. Then any linear system involving the operator becomes a sparse 
system in the wavelet basis. In other words, to solve 



A x = b 


(13.10.16) 
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Figure 13.10.4. (a) IEEE test image, 256 x 256 pixels with 8-bit grayscale, (b) The image is transformed 
into the wavelet basis; 77% of its wavelet components are set to zero (those of smallest magnitude); it 
is then reconstructed from the remaining 23%. (c) Same as (b), but 94.5% of the wavelet components 
are deleted, (d) Same as (c), but the Fourier transform is used instead of the wavelet transform. Wavelet 
coefficients are better than the Fourier coefficients at preserving relevant details. 

we first wavelet-transform the operator A and the right-hand side b by 

A = W-A-W t , b = W b (13.10.17) 

where W represents the one-dimensional wavelet transform, then solve 

A-5 = b (13.10.18) 

and finally transform to the answer by the inverse wavelet transform 

x = W T • x (13.10.19) 



(Note that the routine wtn does the complete transformation of A into A.) 
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Figure 13.10.5. Wavelet transform of a 256 x 256 matrix, represented graphically. The original matrix 
has a discontinuous cusp along its diagonal, decaying smoothly away on both sides of the diagonal. In 
wavelet basis, the matrix becomes sparse: Components larger than 10“ 3 are shown as black, components 
larger than 10 -6 as gray, and smaller-magnitude components are white. The matrix indices i and j 
number from the lower left. 


A typical integral operator that compresses well into wavelets has arbitrary (or 
even nearly singular) elements near to its main diagonal, but becomes smooth away 
from the diagonal. An example might be 


f-1 if i = j 

\ | * — j\~^ % otherwise 


(13.10.20) 


Figure 13.10.5 shows a graphical representation of the wavelet transform of this 
matrix, where i and j range over 1... 256, using the DAUB 12 wavelets. Elements 
larger in magnitude than 10 “ 3 times the maximum element are shown as black 
pixels, while elements between 10 -3 and 10 -6 are shown in gray. White pixels are 
< 10 -6 . The indices i and j each number from the lower left. 

In the figure, one sees the hierarchical decomposition into power-of-two sized 
blocks. At the edges or corners of the various blocks, one sees edge effects caused 
by the wrap-around wavelet boundary conditions. Apart from edge effects, within 
each block, the nonnegligible elements are concentrated along the block diagonals. 
This is a statement that, for this type of linear operator, a wavelet is coupled mainly 
to near neighbors in its own hierarchy (square blocks along the main diagonal) and 
near neighbors in other hierarchies (rectangular blocks off the diagonal). 

The number of nonnegligible elements in a matrix like that in Figure 13.10.5 
scales only as N, the linear size of the matrix; as a rough rule of thumb it is about 
10A'log 10 (l/e), where e is the truncation level, e.g., 10 -6 . For a 2000 by 2000 
matrix, then, the matrix is sparse by a factor on the order of 30. 
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Various numerical schemes can be used to solve sparse linear systems of this 
“hierarchically band diagonal” form. Beylkin, Coifman, and Rokhlin [1] make 
the interesting observations that (1) the product of two such matrices is itself 
hierarchically band diagonal (truncating, of course, newly generated elements that 
are smaller than the predetermined threshold e); and moreover that (2) the product 
can be formed in order N operations. 

Fast matrix multiplication makes it possible to find the matrix inverse by 
Schultz’s (or Hotelling’s) method, see §2.5. 

Other schemes are also possible for fast solution of hierarchically band diagonal 
forms. For example, one can use the conjugate gradient method, implemented in 
§2.7 as linbcg. 

CITED REFERENCES AND FURTHER READING: 

Daubechies, I. 1992, Wavelets (Philadelphia: S.I.A.M.). 

Strang, G. 1989, SIAM Review, vol. 31, pp. 614-627. 

Beylkin, G., Coifman, R., and Rokhlin, V. 1991, Communications on Pure and Applied Mathe¬ 
matics, vol. 44, pp. 141-183. [1] 

Daubechies, I. 1988, Communications on Pure and Applied Mathematics, vol. 41, pp. 909-996. 

[2] 

Vaidyanathan, P.P. 1990, Proceedings of the IEEE, vol. 78, pp. 56-93. [3] 

Mallat, S.G. 1989, IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 11, 
pp. 674-693. [4] 

Freedman, M.H., and Press, W.H. 1992, preprint. [5] 


13.11 Numerical Use of the Sampling Theorem 


In §6.10 we implemented an approximating formula for Dawson’s integral due to 
Rybicki. Now that we have become Fourier sophisticates, we can learn that the formula 
derives from numerical application of the sampling theorem (§12.1), normally considered to 
be a purely analytic tool. Our discussion is identical to Rybicki [1 ]. 

For present purposes, the sampling theorem is most conveniently stated as follows: 
Consider an arbitrary function g(t ) and the grid of sampling points L„ = a + nh, where n 
ranges over the integers and a is a constant that allows an arbitrary shift of the sampling 
grid. We then write 

g(t)= Y sitn) sine j^(t — t n ) + e(t) (13.11.1) 

where sine® = sin®/®. The summation over the sampling points is called the sampling 
representation of g(t), and e(f) is its error term. The sampling theorem asserts that the 
sampling representation is exact, that is, e(t) = 0, if the Fourier transform of g(t), 

G(w) = J g(t)e iut dt (13.11.2) 

vanishes identically for |cu| > -n/h. 

When can sampling representations be used to advantage for the approximate numerical 
computation of functions? In order that the error term be small, the Fourier transform G(u>) 
must be sufficiently small for |cu| > Tt/h. On the other hand, in order for the summation 
in (13.11.1) to be approximated by a reasonably small number of terms, the function g(t) 
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itself should be very small outside of a fairly limited range of values of t. Thus we are 
led to two conditions to be satisfied in order that (13.11.1) be useful numerically: Both the 
function g(t) and its Fourier transform G(ui) must rapidly approach zero for large values 
of their respective arguments. 

Unfortunately, these two conditions are mutually antagonistic — the Uncertainty Princi¬ 
ple in quantum mechanics. There exist strict limits on how rapidly the simultaneous approach 
to zero can be in both arguments. According to a theorem of Hardy [2], if g(t) = 0{e~ t ) 
as |t| —> oo and G(u>) = 0(e -uj2 / 4 ) as |w| —> oo, then g(t) = Ce~* 2 , where C is a 
constant. This can be interpreted as saying that of all functions the Gaussian is the most 
rapidly decaying in both t and ui, and in this sense is the “best” function to be expressed 
numerically as a sampling representation. 

Let us then write for the Gaussian g(t) = e -t , 

e -t = ^ e“* n sine — t n ) + e(f) (13.11.3) 


The error e(t) depends on the parameters h and a as well as on t, but it is sufficient for 
the present purposes to state the bound, 

|e(*)| < e - (7r/2/l)2 (13.11.4) 

which can be understood simply as the order of magnitude of the Fourier transform of the 
Gaussian at the point where it “spills over” into the region |w| > n/h. 

When the summation in (13.11.3) is approximated by one with finite limits, say from 
Wo — N to Wo + W, where Wo is the integer nearest to — a/h , there is a further truncation 
error. However, if W is chosen so that W > 7r/(2 h 2 ), the truncation error in the summation 
is less than the bound given by (13.11.4), and, since this bound is an overestimate, we 
shall continue to use it for (13.11.3) as well. The truncated summation gives a remarkably 
accurate representation for the Gaussian even for moderate values of W. For example, 
\e(t)\ < 5 x 10" 5 for h - 1/2 and W = 7; |e(f)| < 2 x 10" 10 for h = 1/3 and W = 15; 
and |e(t)| < 7 x 10 -18 for h = 1/4 and W = 25. 

One may ask, what is the point of such a numerical representation for the Gaussian, 
which can be computed so easily and quickly as an exponential? The answer is that many 
transcendental functions can be expressed as an integral involving the Gaussian, and by 
substituting (13.11.3) one can often find excellent approximations to the integrals as a sum 
over elementary functions. 

Let us consider as an example the function w(z) of the complex variable z = x + iy, 
related to the complex error function by 


w(z) = e z 

having the integral representation 


w{z) = — 


erfc(— iz) 

(13.11.5) 

r e - * 2 dt 
lc t-z 

(13.11.6) 


where the contour C extends from —oo to oo, passing below 2 (see, e.g., [3]). Many methods 
exist for the evaluation of this function (e.g., [4]). Substituting the sampling representation 
(13.11.3) into (13.11.6) and performing the resulting elementary contour integrals, we obtain 


w(z) 


1 

7 xi 


E 


_ t »l-(-l)V ri(c, - ;>/l ‘ 


t n - i 


(13.11.7) 


where we now omit the error term. One should note that there is no singularity as 2 —> tm 
for some n = m, but a special treatment of the mth term will be required in this case (for 
example, by power series expansion). 

An alternative form of equation (13.11.7) can be found by expressing the complex expo¬ 
nential in (13.11.7) in terms of trigonometric functions and using the sampling representation 
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(13.11.3) with z replacing t. This yields 

»( 2 )^e- 2 + i y he~*” 1 ~ 1 )” COS7r ( Q: ~ z )/h (131L8) 

7T* n rT / 00 In — Z 

This form is particularly useful in obtaining Re w(z) when \y\ -C 1. Note that in evaluating 
(13.11.7) the exponential inside the summation is a constant and needs to be evaluated only 
once; a similar comment holds for the cosine in (13.11.8). 

There are a variety of formulas that can now be derived from either equation (13.11.7) 
or (13.11.8) by choosing particular values of a. Eight interesting choices are: a = 0, x, iy, 
or z, plus the values obtained by adding h/2 to each of these. Since the error bound (13.11.3) 
assumed a real value of cc, the choices involving a complex a are useful only if the imaginary 
part of z is not too large. This is not the place to catalog all sixteen possible formulas, and we 
give only two particular cases that show some of the important features. 

First of all let a = 0 in equation (13.11.8), which yields, 

■«»&)« e- 2 + i jr fee-W 21 - { ~ 1 ^ h C ^ nZ/h) (13.11.9) 


This approximation is good over the entire 2 -plane. As stated previously, one has to treat the 
case where one denominator becomes small by expansion in a power series. Formulas for 
the case a = 0 were discussed briefly in [5], They are similar, but not identical, to formulas 
derived by Chiarella and Reichel [6], using the method of Goodwin [7], 

Next, let a = z in (13.11.7), which yields 


w(z) 


2_ y e-<«-" h > a 
iri ^ n 

n odd 


(13.11.10) 


the sum being over all odd integers (positive and negative). Note that we have made the 
substitution n —> — n in the summation. This formula is simpler than (13.11.9) and contains 
half the number of terms, but its error is worse if y is large. Equation (13.11.10) is the source 
of the approximation formula (6.10.3) for Dawson’s integral, used in §6.10. 
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Chapter 14. Statistical Description 
of Data 

14.0 Introduction 


In this chapter and the next, the concept of data enters the discussion more 
prominently than before. 

Data consist of numbers, of course. But these numbers are fed into the computer, 
not produced by it. These are numbers to be treated with considerable respect, neither 
to be tampered with, nor subjected to a numerical process whose character you do 
not completely understand. You are well advised to acquire a reverence for data that 
is rather different from the “sporty” attitude that is sometimes allowable, or even 
commendable, in other numerical tasks. 

The analysis of data inevitably involves some trafficking with the field of 
statistics, that gray area which is not quite a branch of mathematics — and just as 
surely not quite a branch of science. In the following sections, you will repeatedly 
encounter the following paradigm: 

• apply some formula to the data to compute “a statistic” 

• compute where the value of that statistic falls in a probability distribution 
that is computed on the basis of some “null hypothesis” 

• if it falls in a very unlikely spot, way out on a tail of the distribution, 
conclude that the null hypothesis is false for your data set 

If a statistic falls in a reasonable part of the distribution, you must not make 
the mistake of concluding that the null hypothesis is “verified” or “proved.” That is 
the curse of statistics, that it can never prove things, only disprove them! At best, 
you can substantiate a hypothesis by ruling out, statistically, a whole long list of 
competing hypotheses, every one that has ever been proposed. After a while your 
adversaries and competitors will give up trying to think of alternative hypotheses, 
or else they will grow old and die, and then your hypothesis will become accepted. 
Sounds crazy, we know, but that’s how science works! 

In this book we make a somewhat arbitrary distinction between data analysis 
procedures that are model-independent and those that are model-dependent. In the 
former category, we include so-called descriptive statistics that characterize a data 
set in general terms: its mean, variance, and so on. We also include statistical tests 
that seek to establish the “sameness” or “differentness” of two or more data sets, or 
that seek to establish and measure a degree of correlation between two data sets. 
These subjects are discussed in this chapter. 



609 
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In the other category, model-dependent statistics, we lump the whole subject of 
fitting data to a theory, parameter estimation, least-squares fits, and so on. Those 
subjects are introduced in Chapter 15. 

Section 14.1 deals with so-called measures of central tendency, the moments of 
a distribution, the median and mode. In § 14.2 we learn to test whether different data 
sets are drawn from distributions with different values of these measures of central 
tendency. This leads naturally, in §14.3, to the more general question of whether two 
distributions can be shown to be (significantly) different. 

In §14.4—§14.7, we deal with measures of association for two distributions. 
We want to determine whether two variables are “correlated” or “dependent” on 
one another. If they are, we want to characterize the degree of correlation in 
some simple ways. The distinction between parametric and nonparametric (rank) 
methods is emphasized. 

Section 14.8 introduces the concept of data smoothing, and discusses the 
particular case of Savitzky-Golay smoothing filters. 

This chapter draws mathematically on the material on special functions that 
was presented in Chapter 6, especially §6.1—§6.4. You may wish, at this point, to 
review those sections. 

CITED REFERENCES AND FURTHER READING: 

Bevington, RR. 1969, Data Reduction and Error Analysis for the Physical Sciences (New York: 
McGraw-Hill). 

Stuart, A., and Ord, J.K. 1987, Kendall’s Advanced Theory of Statistics, 5th ed. (London: Griffin 
and Co.) [previous eds. published as Kendall, M., and Stuart, A., The Advanced Theory 
of Statistics]. 

Norusis, M. J. 1982, SPSS Introductory Guide: Basic Statistics and Operations', and 1985, SPSS- 
X Advanced Statistics Guide (New York: McGraw-Hill). 

Dunn, O.J., and Clark, V.A. 1974, Applied Statistics: Analysis of Variance and Regression (New 
York: Wiley). 


14.1 Moments of a Distribution: Mean, 
Variance, Skewness, and So Forth 

When a set of values has a sufficiently strong central tendency, that is, a tendency 
to cluster around some particular value, then it may be useful to characterize the 
set by a few numbers that are related to its moments, the sums of integer powers 
of the values. 

Best known is the mean of the values x 

1 N 
j'=i 

which estimates the value around which central clustering occurs. Note the use of 
an overbar to denote the mean; angle brackets are an equally common notation, e.g., 
(, x ). You should be aware that the mean is not the only available estimator of this 
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quantity, nor is it necessarily the best one. For values drawn from a probability 
distribution with very broad “tails,” the mean may converge poorly, or not at all, as 
the number of sampled points is increased. Alternative estimators, the median and 
the mode, are mentioned at the end of this section. 

Having characterized a distribution’s central value, one conventionally next 
characterizes its “width” or “variability” around that value. Here again, more than 
one measure is available. Most common is the variance, 

1 N 

Var(xi ...x N )= jj—- Y^ x i ~ x f (14.1.2) 

3 = i 


or its square root, the standard deviation. 


cr(x i... £jv)-%3jj£$tepi... x-n) (14.1.3) 

Equation (14.1.2) estimates the mean squared deviation of x from its mean value. 
There is a long story about why the denominator of (14.1.2) is TV — 1 instead of 
N. If you have never heard that story, you may consult any good statistics text. 
Here we will be content to note that the N — 1 should be changed to N if you 
are ever in the situation of measuring the variance of a distribution whose mean 
x is known a priori rather than being estimated from the data. (We might also 
comment that if the difference between N and N — 1 ever matters to you, then you 
are probably up to no good anyway — e.g., trying to substantiate a questionable 
hypothesis with marginal data.) 

As the mean depends on the first moment of the data, so do the variance and 
standard deviation depend on the second moment. It is not uncommon, in real 
life, to be dealing with a distribution whose second moment does not exist (i.e., is 
infinite). In this case, the variance or standard deviation is useless as a measure 
of the data’s width around its central value: The values obtained from equations 
(14.1.2) or (14.1.3) will not converge with increased numbers of points, nor show 
any consistency from data set to data set drawn from the same distribution. This can 
occur even when the width of the peak looks, by eye, perfectly finite. A more robust 
estimator of the width is the average deviation or mean absolute deviation, defined by 

1 N 

ADev(a;i... x N ) = — ^ \xj - x\ (14.1.4) 

jSfjs 

One often substitutes the sample median x me( ] for x in equation (14.1.4). For any 
fixed sample, the median in fact minimizes the mean absolute deviation. 

Statisticians have historically sniffed at the use of (14.1.4) instead of (14.1.2), 
since the absolute value brackets in (14.1.4) are “nonanalytic” and make theorem¬ 
proving difficult. In recent years, however, the fashion has changed, and the subject 
of robust estimation (meaning, estimation for broad distributions with significant 
numbers of “outlier” points) has become a popular and important one. Higher 
moments, or statistics involving higher powers of the input data, are almost always 
less robust than lower moments or statistics that involve only linear sums or (the 
lowest moment of all) counting. 
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Figure 14.1.1. Distributions whose third and fourth moments are significantly different from a normal 
(Gaussian) distribution, (a) Skewness or third moment, (b) Kurtosis or fourth moment. 


That being the case, the skewness or third moment , and the kurtosis or fourth 
moment should be used with caution or, better yet, not at all. 

The skewness characterizes the degree of asymmetry of a distribution around its 
mean. While the mean, standard deviation, and average deviation are dimensional 
quantities, that is, have the same units as the measured quantities Xj, the skewness 
is conventionally defined in such a way as to make it nondimensional. It is a pure 
number that characterizes only the shape of the distribution. The usual definition is 

Skew(si...sjv) = (14.1.5) 

where o = e(x\... xn) is the distribution’s standard deviation (14.1.3). A positive 
value of skewness signifies a distribution with an asymmetric tail extending out 
towards more positive x\ a negative value signifies a distribution whose tail extends 
out towards more negative x (see Figure 14.1.1). 

Of course, any set of N measured values is likely to give a nonzero value for 

(14.1.5) , even if the underlying distribution is in fact symmetrical (has zero skewness). 
For (14.1.5) to be meaningful, we need to have some idea of its standard deviation 
as an estimator of the skewness of the underlying distribution. Unfortunately, that 
depends on the shape of the underlying distribution, and rather critically on its tails! 
For the idealized case of a normal (Gaussian) distribution, the standard deviation of 

(14.1.5) is approximately ^15/N when x is the true mean, and ^Q/N when it is 
estimated by the sample mean, (14.1.1). In real life it is good practice to believe in 
skewnesses only when they are several or many times as large as this. 

The kurtosis is also a nondimensional quantity. It measures the relative 
peakedness or flatness of a distribution. Relative to what? A normal distribution, 
what else! A distribution with positive kurtosis is termed leptokurtic, the outline 
of the Matterhorn is an example. A distribution with negative kurtosis is termed 
platykurtic; the outline of a loaf of bread is an example. (See Figure 14.1.1.) And, 
as you no doubt expect, an in-between distribution is termed mesokurtic. 

The conventional definition of the kurtosis is 

Kurt^.••»)-[^J j- 3 
where the —3 term makes the value zero for a normal distribution. 



(14.1.6) 
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The standard deviation of (14.1.6) as an estimator of the kurtosis of an underlying 
normal distribution is ^/96 /N when cr is the true standard deviation, and y/24 /N 
when it is the sample estimate (14.1.3). However, the kurtosis depends on such 
a high moment that there are many real-life distributions for which the standard 
deviation of (14.1.6) as an estimator is effectively infinite. 

Calculation of the quantities defined in this section is perfectly straightforward. 
Many textbooks use the binomial theorem to expand out the definitions into sums 
of various powers of the data, e.g., the familiar 


Var(a;i... xn) 


1 

N- 1 



(14.1.7) 


but this can magnify the roundoff error by a large factor and is generally unjustifiable 
in terms of computing speed. A clever way to minimize roundoff error, especially 
for large samples, is to use the corrected two-pass algorithm [1 ]: First calculate x, 
then calculate Var(xi... xn) by 


Var(a:i... xn) = 




^2( x j - x ) 

3 =1 


(14.1.1 


The second sum would be zero if x were exact, but otherwise it does a good job of 
correcting the roundoff error in the first term. 


#include <math.h> 


void moment(float data[], int n, float ♦ave, float ♦adev, float *sdev, 
float *var, float ♦skew, float *curt) 

Given an array of data[l. .n], this routine returns its mean ave, average deviation adev, 
standard deviation sdev, variance var, skewness skew, and kurtosis curt. 

{ 

void nrerror(char error_text[]); 
int j; 

float ep=0.0,s,p; 


if (n <= 1) nrerrorC'n must be at least 2 in moment"); 


s=0.0; 

for (j=l; j<=n; j++) s += data[j] ; 
*ave=s/n; 

*adev=(*var)=(♦skew)=(*curt)=0.0; 
for (j=l;j<=n;j++) { 

♦adev += fabs(s=data[j]-(*ave)); 


First pass to get the mean. 


Second pass to get the first (absolute), sec¬ 
ond, third, and fourth moments of the 
deviation from the mean. 


ep - 


: s; 


♦var += (p=s*s); 

♦skew += (p ♦= s); 

♦curt += (p ♦= s); 

> 

♦adev /= n; 

*var=(*var-ep*ep/n)/(n-1); 

*sdev=sqrt(*var); 
if (+var) { 

♦skew /= (n^(♦var)♦(♦sdev)); 

♦curt=(♦curt)/(n^(♦var)♦(+var))-3.0; 

> else nrerrorC'No skew/kurtosis when variance = 0 (in moment)"); 


Corrected two-pass formula. 

Put the pieces together according to the con¬ 
ventional definitions. 
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Semi-Invariants 

The mean and variance of independent random variables are additive: If x and y are 
drawn independently from two, possibly different, probability distributions, then 

(x + y) = x + y Var(a: + y) = Var(ai) + Var(a:) (14.1.9) 

Higher moments are not, in general, additive. However, certain combinations of them, 
called semi-invariants, are in fact additive. If the centered moments of a distribution are 
denoted M k , 

M k = {[ Xi -x) k ) (14.1.10) 

so that, e.g., M 2 = Var(a;), then the first few semi-invariants, denoted Ik are given by 
h = M 2 h = M 3 I 4 = M 4 - 3Mf 

2 (14.1.11) 

Is = Ms- IOM 2 M 3 h = Ms- 15M 2 M 4 - 10 Mf + 30 Ml 

Notice that the skewness and kurtosis, equations (14.1.5) and (14.1.6) are simple powers 
of the semi-invariants, 

Skew(z) = I 3 /if /2 Kurt(a:) = J 4 /if (14.1.12) 

A Gaussian distribution has all its semi-invariants higher than I 2 equal to zero. A Poisson 
distribution has all of its semi-invariants equal to its mean. For more details, see [2], 


Median and Mode 


The median of a probability distribution function p(x) is the value x me d for 
which larger and smaller values of x are equally probable: 

J p(x) dx = i = J p(x) dx (14.1.13) 

The median of a distribution is estimated from a sample of values x 1 ,..., 
xn by finding that value Xi which has equal numbers of values above it and below 
it. Of course, this is not possible when N is even. In that case it is conventional 
to estimate the median as the mean of the unique two central values. If the values 
xj j = 1,.... N are sorted into ascending (or, for that matter, descending) order, 
then the formula for the median is 


j X(N+i)/2, N odd 

l \{ x n/2 + X(n/2)+i)i N even 


(14.1.14) 


If a distribution has a strong central tendency, so that most of its area is under 
a single peak, then the median is an estimator of the central value. It is a more 
robust estimator than the mean is: The median fails as an estimator only if the area 
in the tails is large, while the mean fails if the first moment of the tails is large; 
it is easy to construct examples where the first moment of the tails is large even 
though their area is negligible. 

To find the median of a set of values, one can proceed by sorting the set and 
then applying (14.1.14). This is a process of order N log N. You might rightly think 
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that this is wasteful, since it yields much more information than just the median 
(e.g., the upper and lower quartile points, the deciles, etc.). In fact, we saw in 
§8.5 that the element £(jv+i )/2 can be located in of order N operations. Consult 
that section for routines. 

The mode of a probability distribution function p(x) is the value of x where it 
takes on a maximum value. The mode is useful primarily when there is a single, sharp 
maximum, in which case it estimates the central value. Occasionally, a distribution 
will be bimodal, with two relative maxima; then one may wish to know the two 
modes individually. Note that, in such cases, the mean and median are not very 
useful, since they will give only a “compromise” value between the two peaks. 


CITED REFERENCES AND FURTHER READING: 

Bevington, RR. 1969, Data Reduction and Error Analysis for the Physical Sciences (New York: 
McGraw-Hill), Chapter 2. 

Stuart, A., and Ord, J.K. 1987, Kendall’s Advanced Theory of Statistics, 5th ed. (London: Griffin 
and Co.) [previous eds. published as Kendall, M., and Stuart, A., The Advanced Theory 
of Statistics], vol. 1, §10.15 

Norusis, M. J. 1982, SPSS Introductory Guide: Basic Statistics and Operations', and 1985, SPSS- 
X Advanced Statistics Guide (New York: McGraw-Hill). 

Chan, T.F., Golub, G.H., and LeVeque, R.J. 1983, American Statistician, vol. 37, pp. 242-247. [1] 

Cramer, H. 1946, Mathematical Methods of Statistics (Princeton: Princeton University Press), 
§15.10. [2] 


14.2 Do Two Distributions Have the Same 
Means or Variances? 

Not uncommonly we want to know whether two distributions have the same 
mean. For example, a first set of measured values may have been gathered before 
some event, a second set after it. We want to know whether the event, a “treatment” 
or a “change in a control parameter,” made a difference. 

Our first thought is to ask “how many standard deviations” one sample mean is 
from the other. That number may in fact be a useful thing to know. It does relate to 
the strength or “importance” of a difference of means if that difference is genuine. 
However, by itself, it says nothing about whether the difference is genuine, that is, 
statistically significant. A difference of means can be very small compared to the 
standard deviation, and yet very significant, if the number of data points is large. 
Conversely, a difference may be moderately large but not significant, if the data 
are sparse. We will be meeting these distinct concepts of strength and significance 
several times in the next few sections. 

A quantity that measures the significance of a difference of means is not the 
number of standard deviations that they are apart, but the number of so-called 
standard errors that they are apart. The standard error of a set of values measures 
the accuracy with which the sample mean estimates the population (or “true”) mean. 
Typically the standard error is equal to the sample’s standard deviation divided by 
the square root of the number of points in the sample. 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



616 


Chapter 14. Statistical Description of Data 


Student’s t-test for Significantly Different Means 


Applying the concept of standard error, the conventional statistic for measuring 
the significance of a difference of means is termed Student’s t. When the two 
distributions are thought to have the same variance, but possibly different means, 
then Student’s t is computed as follows: First, estimate the standard error of the 
difference of the means, s d, from the “pooled variance” by the formula 


E 'i£A&y ~ x a ) 2 + Eiesixi ~ x b ) 2 
Na + Nb- 2 

where each sum is over the points in one sample, the first or second, each mean 
likewise refers to one sample or the other, and Na and Nb are the numbers of points 
in the first and second samples, respectively. Second, compute t by 





xa ~ x B 
sd 


(14.2.2) 


Third, evaluate the significance of this value of t for Student’s distribution with 
N a + N b — 2 degrees of freedom, by equations (6.4.7) and (6.4.9), and by the 
routine betai (incomplete beta function) of §6.4. 

The significance is a number between zero and one, and is the probability that 
|£| could be this large or larger just by chance, for distributions with equal means. 
Therefore, a small numerical value of the significance (0.05 or 0.01) means that the 
observed difference is “very significant.” The function A{t\u) in equation (6.4.7) 
is one minus the significance. 

As a routine, we have 



#include <math.h> 

void ttest (float datal[], unsigned long nl, float data2[], unsigned long n2, 
float *t, float *prob) 

Given the arrays datal[l. .nl] and data2[l. ,n2] , this routine returns Student's t as t, 
and its significance as prob, small values of prob indicating that the arrays have significantly 
different means. The data arrays are assumed to be drawn from populations with the same 
true variance. 

{ 

void avevar(float data[], unsigned long n, float *ave, float *var); 
float betai(float a, float b, float x) ; 
float varl.var2.svar.df,avel,ave2; 



avevarCdatal.nl,&avel,&varl); 
avevar(data2,n2,&ave2,&var2); 
df=nl+n2-2; 

svar=((nl-1)*var1+(n2-l)*var2)/df; 
*t=(avel-ave2)/sqrt(svar*(l.0/nl+l.0/n2)); 
*prob=betai(0.5*df,0.5,df/(df+(*t)*(*t))); 


K g. » 

Degrees of freedom. ' %■ |T 

Pooled variance. ? 

See equation (6.4.9). 


which makes use of the following routine for computing the mean and variance 
of a set of numbers. 
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void avevar(float data[], unsigned long n, float *ave, float *var) 
Given array data[l. .n] , returns its mean as ave and its variance as var. 

{ 

unsigned long j; 
float s,ep; 

for (*ave=0.0,j=l;j<=n;j++) *ave += data[j]; 

*ave /= n; 

*var=ep=0.0; 
for (j=l;j<=n;j++) { 
s=data[j] -(*ave); 
ep += s; 

*var += s*s; 

> 

*var=(*var-ep*ep/n)/(n-l); Corrected two-pass formula (14.1.8). 

} 


The next case to consider is where the two distributions have significantly 
different variances, but we nevertheless want to know if their means are the same or 
different. (A treatment for baldness has caused some patients to lose all their hair 
and turned others into werewolves, but we want to know if it helps cure baldness on 
the average !) Be suspicious of the unequal-variance t-test: If two distributions have 
very different variances, then they may also be substantially different in shape; in 
that case, the difference of the means may not be a particularly useful thing to know. 

To find out whether the two data sets have variances that are significantly 
different, you use the F-test, described later on in this section. 

The relevant statistic for the unequal variance t -test is 


_ XA - XB _ 

[Nk(x a )/N a + Vw[x B )/N B ]W 


(14.2.3) 


This statistic is distributed approximately as Student’s t with a number of degrees 
of freedom equal to 


Var (xa) , Var(a;B) 

_ $A JfjS _ 

[Var(& A )/jV A ] 2 ^(^b)/JV b ] 2 

Na- 1 N b -1 

Expression (14.2.4) is in general not an integer, but equation (6.4.7) doesn’t care. 
The routine is 

#include <math.h> 

#include "nrutil.h" 

void tutest(float datal[], unsigned long nl, float data2[], unsigned long n2, 
float *t, float *prob) 

Given the arrays datal [1. .nl] and data2 [1. ,n2] , this routine returns Student’s t as t, and 
its significance as prob, small values of prob indicating that the arrays have significantly differ¬ 
ent means. The data arrays are allowed to be drawn from populations with unequal variances. 
{ 

void avevar(float data[], unsigned long n, float *ave, float *var); 
float betai(float a, float b, float x); 
float varl,var2,df,avel,ave2; 
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avevar(datal,nl,&avel,&varl); 
avevar(data2,n2,&ave2,&var2); 

*t=(avel-ave2)/sqrt(varl/nl+var2/n2); 

df=SQR(varl/nl+var2/n2)/(SqR(varl/nl)/(nl-l)+SQR(var2/n2)/(n2-l)); 
*prob=betai(0.5*df,0.5,df/(df+SQR(*t))); 


Our final example of a Student’s t test is the case of paired samples. Here 
we imagine that much of the variance in both samples is due to effects that are 
point-by-point identical in the two samples. For example, we might have two job 
candidates who have each been rated by the same ten members of a hiring committee. 
We want to know if the means of the ten scores differ significantly. We first try 
ttest above, and obtain a value of prob that is not especially significant (e.g., 
> 0.05). But perhaps the significance is being washed out by the tendency of some 
committee members always to give high scores, others always to give low scores, 
which increases the apparent variance and thus decreases the significance of any 
difference in the means. We thus try the paired-sample formulas, 

1 N 

Cov(xa, xb) = N _ 1 y ~](XAi - xa){xbi - $s) (14.2.5) 

Var(au) + Var(a; B ) - 2Cov(x A ,x B ) 

N 

fA-XB 
SD 

where N is the number in each sample (number of pairs). Notice that it is important 
that a particular value of i label the corresponding points in each sample, that is, 
the ones that are paired. The significance of the t statistic in (14.2.7) is evaluated 
for N 1 degrees of freedom. 

The routine is 



(14.2.6) 


#include <math.h> 

void tptest(float datal[], float data2[], unsigned long n, float *t, 
float *prob) 

Given the paired arrays datal [1. .n] and data2[l. .n] , this routine returns Student's t for 
paired data as t, and its significance as prob, small values of prob indicating a significant 
difference of means. 

{ 

void avevar(float data[], unsigned long n, float *ave, float *var); 
float betaiffloat a, float b, float x); 
unsigned long j; 

float var1,var2,avel,ave2,sd,df,cov=0.0; 

avevar(datal,n,&avel,&varl); 
avevar(data2,n,&ave2,&var2); 
for (j=l;j<=n;j++) 

cov += (datal [j]-avel)*(data2[j]-ave2); 
cov /= df=n-l; 

sd=sqrt((varl+var2-2.0*cov)/n); 

*t=(avel-ave2)/sd; 

*prob=betai(0.5*df,0.5,df/(df+(*t)*(*t))); 
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F-Test for Significantly Different Variances 


The F-test tests the hypothesis that two samples have different variances by 
trying to reject the null hypothesis that their variances are actually consistent. The 
statistic F is the ratio of one variance to the other, so values either > 1 or < 1 
will indicate very significant differences. The distribution of F in the null case is 
given in equation (6.4.11), which is evaluated using the routine betai. In the most 
common case, we are willing to disprove the null hypothesis (of equal variances) by 
either very large or very small values of F, so the correct significance is two-tailed, 
the sum of two incomplete beta functions. It turns out, by equation (6.4.3), that the 
two tails are always equal; we need compute only one, and double it. Occasionally, 
when the null hypothesis is strongly viable, the identity of the two tails can become 
confused, giving an indicated probability greater than one. Changing the probability 
to two minus itself correctly exchanges the tails. These considerations and equation 
(6.4.3) give the routine 


void f test (float datal[], unsigned long nl, float data2[], unsigned long n2, 
float *f, float *prob) 

Given the arrays datal [1. .nl] and data2 [1. ,n2] , this routine returns the value of f, and 
its significance as prob. Small values of prob indicate that the two arrays have significantly 
different variances. 

{ 

void avevar(float data[], unsigned long n, float *ave, float *var); 
float betai(float a, float b, float x) ; 
float var1,var2,avel,ave2,df1,df2; 

avevar(datal,nl,&avel,&varl); 
avevar(data2,n2,&ave2,&var2); 

if (varl > var2) { Make F the ratio of the larger variance to the smaller 

*f=varl/var2; one. 

dfl=nl-l; 
df2=n2-l; 

> else { 

*f=var2/varl; 
df l=n2-l; 
df 2=nl-l; 

> 

*prob = 2.0*betai(0.5*df2,0.5*df1,df2/(df2+dfl*(*f))); 
if (*prob > 1.0) *prob=2.0-*prob; 

> 


CITED REFERENCES AND FURTHER READING: 

von Mises, R. 1964, Mathematical Theory of Probability and Statistics (New York: Academic 
Press), Chapter IX(B). 

Norusis, M. J. 1982, SPSS Introductory Guide: Basic Statistics and Operations', and 1985, SPSS- 
X Advanced Statistics Guide (New York: McGraw-Hill). 
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14.3 Are Two Distributions Different? 


Given two sets of data, we can generalize the questions asked in the previous 
section and ask the single question: Are the two sets drawn from the same distribution 
function, or from different distribution functions? Equivalently, in proper statistical 
language, “Can we disprove, to a certain required level of significance, the null 
hypothesis that two data sets are drawn from the same population distribution 
function?” Disproving the null hypothesis in effect proves that the data sets are from 
different distributions. Failing to disprove the null hypothesis, on the other hand, 
only shows that the data sets can be consistent with a single distribution function. 
One can never prove that two data sets come from a single distribution, since (e.g.) 
no practical amount of data can distinguish between two distributions which differ 
only by one part in 10 10 . 

Proving that two distributions are different, or showing that they are consistent, 
is a task that comes up all the time in many areas of research: Are the visible stars 
distributed uniformly in the sky? (That is, is the distribution of stars as a function 
of declination — position in the sky — the same as the distribution of sky area as 
a function of declination?) Are educational patterns the same in Brooklyn as in the 
Bronx? (That is, are the distributions of people as a function of last-grade-attended 
the same?) Do two brands of fluorescent lights have the same distribution of 
burn-out times? Is the incidence of chicken pox the same for first-born, second-born, 
third-born children, etc.? 

These four examples illustrate the four combinations arising from two different 
dichotomies: (1) The data are either continuous or binned. (2) Either we wish to 
compare one data set to a known distribution, or we wish to compare two equally 
unknown data sets. The data sets on fluorescent lights and on stars are continuous, 
since we can be given fists of individual burnout times or of stellar positions. The 
data sets on chicken pox and educational level are binned, since we are given 
tables of numbers of events in discrete categories: first-bom, second-born, etc.; or 
6th Grade, 7th Grade, etc. Stars and chicken pox, on the other hand, share the 
property that the null hypothesis is a known distribution (distribution of area in the 
sky, or incidence of chicken pox in the general population). Fluorescent lights and 
educational level involve the comparison of two equally unknown data sets (the two 
brands, or Brooklyn and the Bronx). 

One can always turn continuous data into binned data, by grouping the events 
into specified ranges of the continuous variable(s): declinations between 0 and 10 
degrees, 10 and 20, 20 and 30, etc. Binning involves a loss of information, however. 
Also, there is often considerable arbitrariness as to how the bins should be chosen. 
Along with many other investigators, we prefer to avoid unnecessary binning of data. 

The accepted test for differences between binned distributions is the chi-square 
test. For continuous data as a function of a single variable, the most generally 
accepted test is the Kolmogorov-Smirnov test. We consider each in turn. 



Chi-Square Test 

Suppose that is the number of events observed in the zth bin, and that n , is 
the number expected according to some known distribution. Note that the Wj’s are 
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integers, while the n»’s may not be. Then the chi-square statistic is 

2 _ ( Ni ~ n ») 2 

x i Hi 


(14.3.1) 


where the sum is over all bins. A large value of x 2 indicates that the null hypothesis 
(that the Nfs are drawn from the population represented by the n *’s) is rather unlikely. 

Any term j in (14.3.1) with 0 = rij = Nj should be omitted from the sum. A 
term with rij =0, Nj f 0 gives an infinite x 2 > as it should, since in this case the 
Nfs cannot possibly be drawn from the n,’s! 

The chi-square probability function Q(x' 2 \ y ) is an incomplete gamma function, 
and was already discussed in §6.2 (see equation 6.2.18). Strictly speaking Q(x 2 \v) 
is the probability that the sum of the squares of v random normal variables of unit 
variance (and zero mean) will be greater than % 2 - The terms in the sum (14.3.1) 
are not individually normal. However, if either the number of bins is large 1), 
or the number of events in each bin is large 1), then the chi-square probability 
function is a good approximation to the distribution of (14.3.1) in the case of the null 
hypothesis. Its use to estimate the significance of the chi-square test is standard. 

The appropriate value of v, the number of degrees of freedom, bears some 
additional discussion. If the data are collected with the model nf s fixed — that 
is, not later renormalized to fit the total observed number of events EAT, — then v 
equals the number of bins Nb- (Note that this is not the total number of events]) 
Much more commonly, the nfs are normalized after the fact so that their sum equals 
the sum of the Nfs. In this case the correct value for v is Nb — 1, and the model 
is said to have one constraint (knstrn=l in the program below). If the model that 
gives the n, ’s has additional free parameters that were adjusted after the fact to agree 
with the data, then each of these additional “fitted” parameters decreases v (and 
increases knstrn) by one additional unit. 

We have, then, the following program: 

void chsone(float bins[], float ebins [] , int nbins, int knstrn, float *df, 
float *chsq, float *prob) 

Given the array bins[l. .nbins] containing the observed numbers of events, and an array 
ebins [1. .nbins] containing the expected numbers of events, and given the number of con¬ 
straints knstrn (normally one), this routine returns (trivially) the number of degrees of freedom 
df, and (nontrivially) the chi-square chsq and the significance prob. A small value of prob 
indicates a significant difference between the distributions bins and ebins. Note that bins 
and ebins are both float arrays, although bins will normally contain integer values. 

{ 

float gammq(float a, float x); 
void nrerrorfchar error_text[]); 
int j; 
float temp; 

*df =nbins-knstrn; 

*chsq=0.0; 

for (j=l;j<=nbins;j++) { 

if (ebins[j] <= 0.0) nrerrorO'Bad expected number in chsone"); 
temp=bins[j]-ebins[j]; 

*chsq += temp*temp/ebins[j] ; 

> 

*prob=gammq(0.5*(*df) ,0.5*(*chsq)); Chi-square probability function. See §6.2. 

> 
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Next we consider the case of comparing two binned data sets. Let Ri be the 
number of events in bin i for the first data set, Si the number of events in the same 
bin i for the second data set. Then the chi-square statistic is 


E 


( fl . - S .) 2 

Ri + Si 


(14.3.2) 


Comparing (14.3.2) to (14.3.1), you should note that the denominator of (14.3.2) is 
not just the average of Ri and S* (which would be an estimator of n t in 14.3.1). 
Rather, it is twice the average, the sum. The reason is that each term in a chi-square 
sum is supposed to approximate the square of a normally distributed quantity with 
unit variance. The variance of the difference of two normal quantities is the sum 
of their individual variances, not the average. 

If the data were collected in such a way that the sum of the Ri s is necessarily 
equal to the sum of Si's, then the number of degrees of freedom is equal to one 
less than the number of bins, N B — 1 (that is, knstrn = 1), the usual case. If 
this requirement were absent, then the number of degrees of freedom would be N B . 
Example: A birdwatcher wants to know whether the distribution of sighted birds 
as a function of species is the same this year as last. Each bin corresponds to one 
species. If the birdwatcher takes his data to be the first 1000 birds that he saw in 
each year, then the number of degrees of freedom is N B — L If he takes his data to 
be all the birds he saw on a random sample of days, the same days in each year, then 
the number of degrees of freedom is N B (knstrn = 0). In this latter case, note that 
he is also testing whether the birds were more numerous overall in one year or the 
other: That is the extra degree of freedom. Of course, any additional constraints on 
the data set lower the number of degrees of freedom (i.e., increase knstrn to more 
positive values) in accordance with their number. 

The program is 


void chstwo(float binsl[], float bins2[], int nbins, int knstrn, float *df, 
float *chsq, float *prob) 

Given the arrays binsl[l. .nbins] and bins2[l. .nbins], containing two sets of binned 
data, and given the number of constraints knstrn (normally 1 or 0), this routine returns the 
number of degrees of freedom df , the chi-square chsq, and the significance prob. A small value 
of prob indicates a significant difference between the distributions binsl and bins2. Note that 
binsl and bins2 are both float arrays, although they will normally contain integer values. 

{ 

float gammqCfloat a, float x); 
int j; 
float temp; 

*df=nbins-knstrn; 

*chsq=0.0; 

for (j=l;j<=nbins;j++) 

if (binsl [j] == 0.0 bins2[j] == 0.0) 

— (*df) ; No data means one less degree of free- 

else { dom. 

temp=binsl[j]-bins2[j]; 

*chsq += temp*temp/(binsl [j]+bins2 [j] ) ; 

> 

*prob=gammq(0.5*(*df),0.5*(*chsq)); 



} 


Chi-square probability function. See §6.2. 
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Equation (14.3.2) and the routine chstwo both apply to the case where the total 
number of data points is the same in the two binned sets. For unequal numbers of 
data points, the formula analogous to (14.3.2) is 


2 _ V- (VS/RRj - ^R/SSj ) 2 
X ^ Ri + -S', 


(14.3.3) 


where 


R=J2 R i S=J2Si (14.3.4) 

are the respective numbers of data points. It is straightforward to make the 
corresponding change in chstwo. 

Kolmogorov-Smirnov Test 

The Kolmogorov-Smirnov (or KS) test is applicable to unbinned distributions 
that are functions of a single independent variable, that is, to data sets where each 
data point can be associated with a single number (lifetime of each lightbulb when 
it bums out, or declination of each star). In such cases, the list of data points can 
be easily converted to an unbiased estimator Sn(x) of the cumulative distribution 
function of the probability distribution from which it was drawn: If the N events are 
located at values i = 1,..., N, then Sm('J') is the function giving the fraction 
of data points to the left of a given value x. This function is obviously constant 
between consecutive (i.e., sorted into ascending order) Xi s, and jumps by the same 
constant 1/N at each a:,. (See Figure 14.3.1.) 

Different distribution functions, or sets of data, give different cumulative 
distribution function estimates by the above procedure. However, all cumulative 
distribution functions agree at the smallest allowable value of x (where they are 
zero), and at the largest allowable value of x (where they are unity). (The smallest 
and largest values might of course be ±oo.) So it is the behavior between the largest 
and smallest values that distinguishes distributions. 

One can think of any number of statistics to measure the overall difference 
between two cumulative distribution functions: the absolute value of the area between 
them, for example. Or their integrated mean square difference. The Kolmogorov- 
Smirnov D is a particularly simple measure: It is defined as the maximum value 
of the absolute difference between two cumulative distribution functions. Thus, 
for comparing one data set’s Sn(x ) to a known cumulative distribution function 
P(x), the K-S statistic is 

D = max IS^a;) — P(a;)| (14.3.5) 



while for comparing two different cumulative distribution functions S jvi ( x ) and 
Sn 2 (x), the K-S statistic is 

D = max l^jv^a;) — 5jv 2 (a:)| 


(14.3.6) 
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x 

Figure 14.3.1. Kolmogorov-Smimov statistic D. A measured distribution of values in x (shown 
as N dots on the lower abscissa) is to be compared with a theoretical distribution whose cumulative 
probability distribution is plotted as P(x). A step-function cumulative probability distribution Sr(x) is 
constructed, one that rises an equal amount at each measured point. D is the greatest distance between 
the two cumulative distributions. 

What makes the K-S statistic useful is that its distribution in the case of the null 
hypothesis (data sets drawn from the same distribution) can be calculated, at least to 
useful approximation, thus giving the significance of any observed nonzero value of 
D. A central feature of the K-S test is that it is invariant under reparametrization 
of x\ in other words, you can locally slide or stretch the x axis in Figure 14.3.1, 
and the maximum distance D remains unchanged. For example, you will get the 
same significance using x as using log a;. 

The function that enters into the calculation of the significance can be written 
as the following sum: 


Qks( A) = 2 J^-l)'" 1 e- 2 ^ 2 (14.3.7) 

3= 1 

which is a monotonic function with the limiting values 

Qks( 0) = 1 Qas(oo) = 0 (14.3.8) 

In terms of this function, the significance level of an observed value of D (as 
a disproof of the null hypothesis that the distributions are the same) is given 
approximately [1 ] by the formula 

Probability ( D > observed ) = Qks ( + 0.12 + 0.11/y/A^j D^j 

(14.3.9) 
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where N e is the effective number of data points, N e = N for the case (14.3.5) 
of one distribution, and 


N e 


N ^ 2 

N x +N 2 


(14.3.10) 


for the case (14.3.6) of two distributions, where N i is the number of data points in 
the first distribution, iV 2 the number in the second. 

The nature of the approximation involved in (14.3.9) is that it becomes asymp¬ 
totically accurate as the N e becomes large, but is already quite good for N e > 4, as 
small a number as one might ever actually use. (See [1].) 

So, we have the following routines for the cases of one and two distributions: 


#include <math.h> 
#include "nrutil.h" 


void ksone(float data[], unsigned long n, float (*func)(float), float *d, 
float *prob) 

Given an array data[l. .n] , and given a user-supplied function of a single variable func which 
is a cumulative distribution function ranging from 0 (for smallest values of its argument) to 1 
(for largest values of its argument), this routine returns the K-S statistic d, and the significance 
level prob. Small values of prob show that the cumulative distribution function of data is 
significantly different from func. The array data is modified by being sorted into ascending 
order. 

{ 

float probks(float alam); 
void sort(unsigned long n, float arr[]); 
unsigned long j; 
float dt,en,ff,fn,fo=0.0; 

sort(n,data); 
en=n; 

*d=0.0; 

for (j=l;j<=n;j++) { 
fn=j/en; 

ff=(*func) (data[j] ) ; 
dt=FMAX(fabs(fo-ff).fabs(fn-ff)); 
if (dt > *d) *d=dt; 
fo=fn; 

> 

en=sqrt(en); 

*prob=probks((en+0.12+0.ll/en)*(*d)) ; 


If the data are already sorted into as¬ 
cending order, then this call can be 
omitted. 

Loop over the sorted data points. 
Data's c.d.f. after this step. 

Compare to the user-supplied function. 
Maximum distance. 


Compute significance. 


#include <math.h> 

void kstwo (float datal[], unsigned long nl, float data2[], unsigned long n2, 
float *d, float *prob) 

Given an array datal [1. .nl] , and an array data2 [1. .n2] , this routine returns the K- 
S statistic d, and the significance level prob for the null hypothesis that the data sets are 
drawn from the same distribution. Small values of prob show that the cumulative distribution 
function of datal is significantly different from that of data2. The arrays datal and data2 
are modified by being sorted into ascending order. 

{ 

float probks(float alam); 

void sort(unsigned long n, float arr[]); 

unsigned long j1=1,j2=1; 

float dl,d2,dt,enl,en2,en,fnl=0.0,fn2=0.0; 
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sort(nl,datal); 
sort(n2,data2); 
enl=nl; 
en2=n2; 

*d=0.0; 

while (jl <= nl kk j2 <= n2) { 

if ((dl=datal[j1]) <= (d2=data2[j2])) fnl=jl++/enl; 
if (d2 <= dl) fn2=j2++/en2; 
if ((dt=fabs(fn2-fnl)) > *d) *d=dt; 

> 

en=sqrt(enl*en2/(enl+en2)); 

*prob=probks((en+0.12+0.ll/en)*(*d)); 


If we are not done... 
Next step is in datal. 
Next step is in data2. 


Compute significance. 


Both of the above routines use the following routine for calculating the function 
Qks- 

#include <math.h> 

#define EPS1 0.001 
tdefine EPS2 1.0e-8 

float probks(float alam) 

Kolmogorov-Smirnov probability function. 

{ 

int j; 

float a2,fac=2.0,sum=0.0,term,t ermbf=0.0; 

a2 = -2.0*alam*alam; 
for (j=l;j<=100;j++) { 
term=fac*exp(a2*j*j); 
sum += term; 

if (fabs(term) <= EPSl*termbf I I fabs(term) <= EPS2*sum) return sum; 
fac = -fac; Alternating signs in sum. 

termbf=fabs(term); 

> 

return 1.0; Get here only by failing to converge. 


Variants on the K-S Test 

The sensitivity of the K-S test to deviations from a cumulative distribution function 
P(x) is not independent of x. In fact, the K-S test tends to be most sensitive around the 
median value, where P(x) = 0.5, and less sensitive at the extreme ends of the distribution, 
where P(x) is near 0 or 1. The reason is that the difference ,S',v (x) — P(x) | does not, in the 
null hypothesis, have a probability distribution that is independent of x. Rather, its variance is 
proportional to P{x)[ 1 — P(x)], which is largest at P = 0.5. Since the K-S statistic (14.3.5) 
is the maximum difference over all x of two cumulative distribution functions, a deviation that 
might be statistically significant at its own value of x gets compared to the expected chance 
deviation at P = 0.5, and is thus discounted. A result is that, while the K-S test is good at 
finding shifts in a probability distribution, especially changes in the median value, it is not 
always so good at finding spreads, which more affect the tails of the probability distribution, 
and which may leave the median unchanged. 

One way of increasing the power of the K-S statistic out on the tails is to replace 
D (equation 14.3.5) by a so-called stabilized or weighted statistic [2-4], for example the 
Anderson-Darling statistic, 

\S N (x)-P(x)\ 



D ’ 


max 

x><aj<< 


(14.3.11) 
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Unfortunately, there is no simple formula analogous to equations (14.3.7) and (14.3.9) for this 
statistic, although Noe [5] gives a computational method using a recursion relation and provides 
a graph of numerical results. There are many other possible similar statistics, for example 


D** = 


P(x)[l-P(x)} 


dP(x) 


(14.3.12) 


which is also discussed by Anderson and Darling (see [3]). 

Another approach, which we prefer as simpler and more direct, is due to Kuiper[6,7], 
We already mentioned that the standard K-S test is invariant under reparametrizations of the 
variable x. An even more general symmetry, which guarantees equal sensitivities at all values 
of x, is to wrap the x axis around into a circle (identifying the points at ±oo), and to look for 
a statistic that is now invariant under all shifts and parametrizations on the circle. This allows, 
for example, a probability distribution to be “cut” at some central value of x, and the left and 
right halves to be interchanged, without altering the statistic or its significance. 

Kuiper’s statistic, defined as 


is the sum of the maximum distance of Sn(x) above and below P(x). You should be able 
to convince yourself that this statistic has the desired invariance on the circle: Sketch the 
indefinite integral of two probability distributions defined on the circle as a function of angle 
around the circle, as the angle goes through several times 360°. If you change the starting 
point of the integration, D + and D_ change individually, but their sum is constant. 

Furthermore, there is a simple formula for the asymptotic distribution of the statistic V, 
directly analogous to equations (14.3.7)-(14.3.10). Let 


Qkp(X) = 2^(4j' 2 A 2 — l)e _2j2>,: 

3=1 

which is monotonic and satisfies 


(14.3.14) 


Qkp{ 0) = 1 Qkp{ oo) = 0 (14.3.15) 

In terms of this function the significance level is [1 ] 

Probability (V > observed ) = Qkp ^ [\/iVe + 0.155 + 0.24/-\/]Vej v\ (14.3.16) 


Here N e is N in the one-sample case, or is given by equation (14.3.10) in the case of 
two samples. 

Of course, Kuiper’s test is ideal for any problem originally defined on a circle, for 
example, to test whether the distribution in longitude of something agrees with some theory, 
or whether two somethings have different distributions in longitude. (See also [8].) 

We will leave to you the coding of routines analogous to ksone, kstwo, and probks, 
above. (For A < 0.4, don’t try to do the sum 14.3.14. Its value is 1, to 7 figures, but the series 
can require many terms to converge, and loses accuracy to roundoff.) 

Two final cautionary notes: First, we should mention that all varieties of K-S test lack 
the ability to discriminate some kinds of distributions. A simple example is a probability 
distribution with a narrow “notch” within which the probability falls to zero. Such a 
distribution is of course ruled out by the existence of even one data point within the notch, 
but, because of its cumulative nature, a K-S test would require many data points in the notch 
before signaling a discrepancy. 

Second, we should note that, if you estimate any parameters from a data set (e.g., a mean 
and variance), then the distribution of the K-S statistic D for a cumulative distribution function 
P(x) that uses the estimated parameters is no longer given by equation (14.3.9). In general, 
you will have to determine the new distribution yourself, e.g., by Monte Carlo methods. 
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14.4 Contingency Table Analysis of Two 
Distributions 

In this section, and the next two sections, we deal with measures of association 
for two distributions. The situation is this: Each data point has two or more different 
quantities associated with it, and we want to know whether knowledge of one quantity 
gives us any demonstrable advantage in predicting the value of another quantity. In 
many cases, one variable will be an “independent” or “control” variable, and another 
will be a “dependent” or “measured” variable. Then, we want to know if the latter 
variable is in fact dependent on or associated with the former variable. If it is, we 
want to have some quantitative measure of the strength of the association. One often 
hears this loosely stated as the question of whether two variables are correlated or 
uncorrelated , but we will reserve those terms for a particular kind of association 
(linear, or at least monotonic), as discussed in § 14.5 and § 14.6. 

Notice that, as in previous sections, the different concepts of significance and 
strength appear: The association between two distributions may be very significant 
even if that association is weak — if the quantity of data is large enough. 

It is useful to distinguish among some different kinds of variables, with different 
categories forming a loose hierarchy. 

• A variable is called nominal if its values are the members of some unordered 
set. For example, “state of residence” is a nominal variable that (in the 
U.S.) takes on one of 50 values; in astrophysics, “type of galaxy” is a 
nominal variable with the three values “spiral,” “elliptical,” and “irregular.” 

• A variable is termed ordinal if its values are the members of a discrete, but 
ordered, set. Examples are: grade in school, planetary order from the Sun 
(Mercury = 1, Venus = 2, ...), number of offspring. There need not be 
any concept of “equal metric distance” between the values of an ordinal 
variable, only that they be intrinsically ordered. 

• We will call a variable continuous if its values are real numbers, as 
are times, distances, temperatures, etc. (Social scientists sometimes 
distinguish between interval and ratio continuous variables, but we do not 
find that distinction very compelling.) 

A continuous variable can always be made into an ordinal one by binning it 
into ranges. If we choose to ignore the ordering of the bins, then we can turn it into 
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Figure 14.4.1. Example of a contingency table for two nominal variables, here sex and color. The row 
and column marginals (totals) are shown. The variables are “nominal,” i.e., the order in which their values 
are listed is arbitrary and does not affect the result of the contingency table analysis. If the ordering 
of values has some intrinsic meaning, then the variables are “ordinal” or “continuous,” and correlation 
techniques (§ 14.5-§ 14.6) can be utilized. 

a nominal variable. Nominal variables constitute the lowest type of the hierarchy, 
and therefore the most general. For example, a set of several continuous or ordinal 
variables can be turned, if crudely, into a single nominal variable, by coarsely 
binning each variable and then taking each distinct combination of bin assignments 
as a single nominal value. When multidimensional data are sparse, this is often 
the only sensible way to proceed. 

The remainder of this section will deal with measures of association between 
nominal variables. For any pair of nominal variables, the data can be displayed as 
a contingency table, a table whose rows are labeled by the values of one nominal 
variable, whose columns are labeled by the values of the other nominal variable, 
and whose entries are nonnegative integers giving the number of observed events 
for each combination of row and column (see Figure 14.4.1). The analysis of 
association between nominal variables is thus called contingency table analysis or 
crosstabulation analysis. 

We will introduce two different approaches. The first approach, based on the 
chi-square statistic, does a good job of characterizing the significance of association, 
but is only so-so as a measure of the strength (principally because its numerical 
values have no very direct interpretations). The second approach, based on the 
information-theoretic concept of entropy, says nothing at all about the significance of 
association (use chi-square for that!), but is capable of very elegantly characterizing 
the strength of an association already known to be significant. 

Measures of Association Based on Chi-Square 



Some notation first: Let A’,, denote the number of events that occur with the 
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first variable x taking on its ith value, and the second variable y taking on its jth 
value. Let N denote the total number of events, the sum of all the Njj ’s. Let N t . 
denote the number of events for which the first variable x takes on its ith value 
regardless of the value of y; N.j is the number of events with the jth value of y 
regardless of x. So we have 


^• = E% 

3 


* 

3 


(14.4.1) 


N.j and N,. are sometimes called the row and column totals or marginals, but we 
will use these terms cautiously since we can never keep straight which are the rows 
and which are the columns! 

The null hypothesis is that the two variables x and y have no association. In this 
case, the probability of a particular value of x given a particular value of y should 
be the same as the probability of that value of x regardless of y. Therefore, in the 
null hypothesis, the expected number for any Njj, which we will denote n^, can be 
calculated from only the row and column totals, 

—= —- which implies n, 7 - = ^ l ' (14.4.2) 

N.j N 3 N v ' 

Notice that if a column or row total is zero, then the expected number for all the 
entries in that column or row is also zero; in that case, the never-occurring bin of x 
or y should simply be removed from the analysis. 

The chi-square statistic is now given by equation (14.3.1), which, in the present 
case, is summed over all entries in the table, 


X = , 


' (Njj - mj) 2 


(14.4.3) 


The number of degrees of freedom is equal to the number of entries in the table 
(product of its row size and column size) minus the number of constraints that have 
arisen from our use of the data themselves to determine the n ij. Each row total and 
column total is a constraint, except that this overcounts by one, since the total of the 
column totals and the total of the row totals both equal N, the total number of data 
points. Therefore, if the table is of size I by J, the number of degrees of freedom is 
IJ — / — J + 1. Equation (14.4.3), along with the chi-square probability function 
(§6.2), now give the significance of an association between the variables x and y. 

Suppose there is a significant association. How do we quantify its strength, so 
that (e.g.) we can compare the strength of one association with another? The idea 
here is to find some reparametrization of x 2 which maps it into some convenient 
interval, like 0 to 1, where the result is not dependent on the quantity of data that we 
happen to sample, but rather depends only on the underlying population from which 
the data were drawn. There are several different ways of doing this. Two of the more 
common are called Cramer’s V and the contingency coefficient C. 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



14.4 Contingency Table Analysis of Two Distributions 


631 


The formula for Cramer’s V is 


V = {n min;/-1,.7-i ? (14A4) 

where I and J are again the numbers of rows and columns, and N is the total 
number of events. Cramer’s V has the pleasant property that it lies between zero 
and one inclusive, equals zero when there is no association, and equals one only 
when the association is perfect: All the events in any row lie in one unique column, 
and vice versa. (In chess parlance, no two rooks, placed on a nonzero table entry, 
can capture each other.) 

In the case of I = J = 2, Cramer’s V is also referred to as the phi statistic. 
The contingency coefficient C is defined as 


C = 



(14.4.5) 


It also lies between zero and one, but (as is apparent from the formula) it can never 
achieve the upper limit. While it can be used to compare the strength of association 
of two tables with the same I and J, its upper limit depends on I and J. Therefore 
it can never be used to compare tables of different sizes. 

The trouble with both Cramer’s V and the contingency coefficient Cis that, when 
they take on values in between their extremes, there is no very direct interpretation 
of what that value means. For example, you are in Las Vegas, and a friend tells you 
that there is a small, but significant, association between the color of a croupier’s 
eyes and the occurrence of red and black on his roulette wheel. Cramer’s V is about 
0.028, your friend tells you. You know what the usual odds against you are (because 
of the green zero and double zero on the wheel). Is this association sufficient for 
you to make money? Don’t ask us! 

#include <math.h> 

#include "nrutil.h" 

#define TINY 1.0e-30 A small number. 

void cntabl(int **nn, int ni, int nj, float *chisq, float *df, float *prob, 
float *cramrv, float *ccc) 

Given a two-dimensional contingency table in the form of an integer array nn[l. .ni] [1. .nj] , 
this routine returns the chi-square chisq, the number of degrees of freedom df, the significance 
level prob (small values indicating a significant association), and two measures of association, 
Cramer's V (cramrv) and the contingency coefficient C (ccc). 

{ 

float gammq(float a, float x); 

int nnj,nni,j,i,minij; 

float sum=0.0,expctd,*sumi,*sumj ,temp; 

sumi=vector(1,ni); 
sumj=vector(l,nj); 
nni=ni; 
rmj =nj ; 

for (i=l;i<=ni;i++) { 
sumi[i]=0.0; 
for (j=l;j<=nj;j++) { 
sumi [i] += nn[i] [j] 
sum += nn[i] [j] ; 


Number of rows 
and columns. 

Get the row totals. 
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if (sumi [i] 


for (j=l;j<=nj;j++) { 
sum j [ j ] =0.0; 
for (i=l;i<=ni;i++) sumj [j] += im[i][j]; 


Eliminate any zero rows by reducing the num¬ 
ber. 

Get the column totals. 


> 


if (sumj [j] == 0.0) —nnj ; 


*df=nni*nnj -nni-imj+1; 

*chisq=0.0; 
for (i=l;i<=ni;i++) { 

for (j=l;j<=nj;j++) { 

expctd=sumj[j]*sumi[i]/sum; 
temp=nn[i] [j]-expctd; 

*chisq += temp*temp/(expctd+TINY); 

> 

> 

*prob=gammq(0.5*(*df),0.5*(*chisq)); 
minij = nni < nnj ? nni-1 : nnj-1; 
*cramrv=sqrt(*chisq/(sum*minij)); 
*ccc=sqrt(*chisq/(*chisq+sum)); 
free_vector(sumj,l,nj); 
free_vector(sumi,1,ni); 


Eliminate any zero columns. 

Corrected number of degrees of freedom. 
Do the chi-square sum. 


Here TINY guarantees that any 
eliminated row or column will 
not contribute to the sum. 
Chi-square probability function. 


Measures of Association Based on Entropy 

Consider the game of “twenty questions,” where by repeated yes/no questions 
you try to eliminate all except one correct possibility for an unknown object. Better 
yet, consider a generalization of the game, where you are allowed to ask multiple 
choice questions as well as binary (yes/no) ones. The categories in your multiple 
choice questions are supposed to be mutually exclusive and exhaustive (as are “yes” 
and “no”). 

The value to you of an answer increases with the number of possibilities that 
it eliminates. More specifically, an answer that eliminates all except a fraction p of 
the remaining possibilities can be assigned a value — lnp (a positive number, since 
p < 1). The purpose of the logarithm is to make the value additive, since (e.g.) one 
question that eliminates all but 1/6 of the possibilities is considered as good as two 
questions that, in sequence, reduce the number by factors 1/2 and 1/3. 

So that is the value of an answer; but what is the value of a question? If there 
are I possible answers to the question (i = 1,...,/) and the fraction of possibilities 
consistent with the zth answer is p, (with the sum of the pi s equal to one), then the 
value of the question is the expectation value of the value of the answer, denoted H, 

i 

H = — ^^pilnpi (14.4.6) 

i=l 

In evaluating (14.4.6), note that 

lim p In p = 0 (14.4.7) 



The value H lies between 0 and In J. It is zero only when one of the p ,’s is one, all 
the others zero: In this case, the question is valueless, since its answer is preordained. 
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H takes on its maximum value when all the p,/s are equal, in which case the question 
is sure to eliminate all but a fraction 1 /I of the remaining possibilities. 

The value H is conventionally termed the entropy of the distribution given by 
the p^s, a terminology borrowed from statistical physics. 

So far we have said nothing about the association of two variables; but suppose 
we are deciding what question to ask next in the game and have to choose between 
two candidates, or possibly want to ask both in one order or another. Suppose that 
one question, x, has I possible answers, labeled by i, and that the other question, 
y, as J possible answers, labeled by j. Then the possible outcomes of asking both 
questions form a contingency table whose entries Nij, when normalized by dividing 
by the total number of remaining possibilities N, give all the information about the 
p’s. In particular, we can make contact with the notation (14.4.1) by identifying 


(outcomes of question x alone) (14.4.8) 

(outcomes of question y alone) 

The entropies of the questions x and y are, respectively, 

H{x) = - Yl Pi - lnft - 77 (y) = - l> i lnp i (14.4.9) 

i 3 


Nij 

Pij = W 

N t . 

Pi = w 

N.j 

P-3 = -TF 


The entropy of the two questions together is 

H(x,y) = - ^ Pij In pi j (14.4.10) 

hj 


Now what is the entropy of the question y given x (that is, if x is asked first)? 
It is the expectation value over the answers to x of the entropy of the restricted 
y distribution that lies in a single column of the contingency table (corresponding 
to the x answer): 


H(y\x) = = 


(14.4.11) 


Correspondingly, the entropy of x given y is 


H ( x \y) = -= - ln „ 

„• „• P-3 P-3 j „■ P- 


Pij 


(14.4.12) 



We can readily prove that the entropy of y given x is never more than the 
entropy of y alone, i.e., that asking x first can only reduce the usefulness of asking 
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y (in which case the two variables are associated !): 

H(y\x) — H(y) = -^PiS \n Pt3 ^ h ' 

= - E^ 



(14.4.13) 

$ & 

V Pa ) 


I ^ 


where the inequality follows from the fact 

lnw<w-l (14.4.14) 

We now have everything we need to define a measure of the “dependency” of y 
on x, that is to say a measure of association. This measure is sometimes called the 
uncertainty coefficient of y. We will denote it as U(y\x), 

Um - (14.4.15) 

This measure lies between zero and one, with the value 0 indicating that x and y 
have no association, the value 1 indicating that knowledge of x completely predicts 
y. For in-between values, U(y\x) gives the fraction of y’s entropy H(y') that is 
lost if x is already known (i.e., that is redundant with the information in x). In our 
game of “twenty questions,” U(y\x) is the fractional loss in the utility of question 
y if question x is to be asked first. 

If we wish to view x as the dependent variable, y as the independent one, then 
interchanging x and y we can of course define the dependency of x on y, 

If we want to treat x and y symmetrically, then the useful combination turns 
out to be 

w- r;y i " 4Ai7 > 

If the two variables are completely independent, then H(x, y) = H(x) + H(y '), so 
(14.4.17) vanishes. If the two variables are completely dependent, then H(x) = 
H(y) = H(x, y), so (14.4.16) equals unity. In fact, you can use the identities (easily 
proved from equations 14.4.9-14.4.12) 

H(x, y) = H{x) + H(y\x) = H(y) + H(x\y) (14.4.18) 

to show that 

i.e., that the symmetrical measure is just a weighted average of the two asymmetrical 
measures (14.4.15) and (14.4.16), weighted by the entropy of each variable separately. 

Here is a program for computing all the quantities discussed, H(x), H(y), 
H(x\y), H(y\x), H(x,y), U(x\y), U(y\x), and U(x,y): 
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#include <math.h> 

#include "nrutil.h" 

#define TINY 1.0e-30 A small number. 


void cntab2(int **nn, int ni, int nj, float *h, float *hx, float *hy, 
float *hygx, float *hxgy, float *uygx, float *uxgy, float *uxy) 

Given a two-dimensional contingency table in the form of an integer array nn[i] [j] , where i 
labels the x variable and ranges from 1 to ni, j labels the y variable and ranges from 1 to nj, 
this routine returns the entropy h of the whole table, the entropy hx of the x distribution, the 
entropy hy of the y distribution, the entropy hygx of y given x, the entropy hxgy of x given y, 
the dependency uygx of y on x (eq. 14.4.15), the dependency uxgy of x on y (eq. 14.4.16), 
and the symmetrical dependency uxy (eq. 14.4.17). 

{ 

int i, j ; 

float sum=0.0,p,*sumi,*sumj; 


sumi=vector(1,ni); 
sumj=vector(l,nj); 
for (i=l;i<=ni;i++) { 
sumi[i]=0.0; 
for (j=l;j<=nj ; j++) { 
sumi[i] += nn[i] [j] ; 
sum += nn[i] [j] ; 

> 

> 

for (j=l;j<=nj;j++) { 
sum j [ j ] =0.0; 
for (i=l;i<=ni;i++) 

sumj [j] += nn[i] [j] ; 

> 

*hx=0.0; 

for (i=l;i<=ni;i++) 
if (sumi [i]) { 

p=sumi[i]/sum; 

*hx -= p*log(p); 

> 

*hy=0.0; 

for (j=l;j<=nj;j++) 
if (sumj [j] ) { 

p=sumj[j]/sum; 

*hy -= p*log(p); 

> 

*h=0.0; 

for (i=l;i<=ni;i++) 

for (j=l;j<=nj;j++) 
if (nn[i] [j]) { 

p=nn[i] [j] /sum; 

*h -= p*log(p); 

> 

*hygx=(*h)-(*hx); 

*hxgy=(*h)-(*hy); 

*uygx= (*hy-*hygx) /(*hy+TINY) ; 
*uxgy=(*hx-*hxgy)/(*hx+TINY); 

*uxy=2.0*(*hx+*hy-*h)/(*hx+*hy+TINY); 
free_vector(sumj,1,nj); 
free_vector(sumi,1,ni); 


Get the row totals. 


Get the column totals. 


Entropy of the x distribution, 


and of the y distribution. 


Total entropy: loop over both x 
and y. 


Uses equation (14.4.18), 
as does this. 

Equation (14.4.15). 
Equation (14.4.16). 
Equation (14.4.17). 



CITED REFERENCES AND FURTHER READING: 

Dunn, O.J., and Clark, V.A. 1974, Applied Statistics: Analysis of Variance and Regression (New 
York: Wiley). 
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Norusis, M. J. 1982, SPSS Introductory Guide: Basic Statistics and Operations', and 1985, SPSS- 
X Advanced Statistics Guide (New York: McGraw-Hill). 

Fano, R.M. 1961, Transmission of Information (New York: Wiley and MIT Press), Chapter 2. 


14.5 Linear Correlation 


We next turn to measures of association between variables that are ordinal 
or continuous, rather than nominal. Most widely used is the linear correlation 
coefficient. For pairs of quantities ( Xi,yi ), i - \..... N, the linear correlation 
coefficient r (also called the product-moment correlation coefficient, or Pearson’s 
r) is given by the formula 


- x){yi - y) 




where, as usual, x is the mean of the Xj’s, y is the mean of the yfs. 

The value of r lies between —1 and 1, inclusive. It takes on a value of 1, termed 
“complete positive correlation,” when the data points lie on a perfect straight line 
with positive slope, with x and y increasing together. The value 1 holds independent 
of the magnitude of the slope. If the data points lie on a perfect straight line with 
negative slope, y decreasing as x increases, then r has the value —1; this is called 
“complete negative correlation.” A value of r near zero indicates that the variables 
x and y are uncorrelated. 

When a correlation is known to be significant, r is one conventional way of 
summarizing its strength. In fact, the value of r can be translated into a statement 
about what residuals (root mean square deviations) are to be expected if the data are 
fitted to a straight line by the least-squares method (see §15.2, especially equations 
15.2.13 - 15.2.14). Unfortunately, r is a rather poor statistic for deciding whether 
an observed correlation is statistically significant, and/or whether one observed 
correlation is significantly stronger than another. The reason is that r is ignorant of 
the individual distributions of x and y, so there is no universal way to compute its 
distribution in the case of the null hypothesis. 

About the only general statement that can be made is this: If the null hypothesis 
is that x and y are uncorrelated, and if the distributions for x and y each have 
enough convergent moments (“tails” die off sufficiently rapidly), and if N is large 
(typically > 500), then r is distributed approximately normally, with a mean of zero 
and a standard deviation of 1/ \/N. In that case, the (double-sided) significance of 
the correlation, that is, the probability that |r| should be larger than its observed 
value in the null hypothesis, is 



(14.5.2) 



where erfc(x) is the complementary error function, equation (6.2.8), computed by 
the routines erff c or erf cc of §6.2. A small value of (14.5.2) indicates that the 
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two distributions are significantly correlated. (See expression 14.5.9 below for a 
more accurate test.) 

Most statistics books try to go beyond (14.5.2) and give additional statistical 
tests that can be made using r. In almost all cases, however, these tests are valid 
only for a very special class of hypotheses, namely that the distributions of x and y 
jointly form a binormal or two-dimensional Gaussian distribution around their mean 
values, with joint probability density 


p(x,y) dxdy = const, x exp\ — -(anx 2 — 2ai2xy + a 22 Vj^ dxdy (14.5.3) 
where an, ai 2 , and 022 are arbitrary constants. For this distribution r has the value 


There are occasions when (14.5.3) may be known to be a good model of the 
data. There may be other occasions when we are willing to take (14.5.3) as at least 
a rough and ready guess, since many two-dimensional distributions do resemble a 
binormal distribution, at least not too far out on their tails. In either situation, we can 
use (14.5.3) to go beyond (14.5.2) in any of several directions: 

First, we can allow for the possibility that the number N of data points is not 
large. Here, it turns out that the statistic 


is distributed in the null case (of no correlation) like Student’s t-distribution with 
v = N — 2 degrees of freedom, whose two-sided significance level is given by 
1 — A(t\v) (equation 6.4.7). As N becomes large, this significance and (14.5.2) 
become asymptotically the same, so that one never does worse by using (14.5.5), 
even if the binormal assumption is not well substantiated. 

Second, when N is only moderately large (> 10), we can compare whether 
the difference of two significantly nonzero r’s, e.g., from different experiments, is 
itself significant. In other words, we can quantify whether a change in some control 
variable significantly alters an existing correlation between two other variables. This 
is done by using Fisher’s z-transformation to associate each measured r with a 
corresponding z. 


Then, each 2 is approximately normally distributed with a mean value 



where r trU e is the actual or population value of the correlation coefficient, and with 
a standard deviation 


(14.5.8) 
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Equations (14.5.7) and (14.5.8), when they are valid, give several useful 
statistical tests. For example, the significance level at which a measured value of r 
differs from some hypothesized value r true is given by 

•• 4 ^> ^ < 14 - 3 . 9 ) 

where z and z are given by (14.5.6) and (14.5.7), with small values of (14.5.9) 
indicating a significant difference. (Setting z = 0 makes expression 14.5.9 a more 
accurate replacement for expression 14.5.2 above.) Similarly, the significance of a 
difference between two measured correlation coefficients r i and r -2 is 


erfc ( - ) Zl ^ | (14.5.10) 

K'/iy/jth + Kh) 

where z\ and Z 2 are obtained from n and r 2 using (14.5.6), and where N\ and N? 
are, respectively, the number of data points in the measurement of r i and n. 

All of the significances above are two-sided. If you wish to disprove the null 
hypothesis in favor of a one-sided hypothesis, such as that r i > r 2 (where the sense 
of the inequality was decided a priori), then (i) if your measured r i and r -2 have 
the wrong sense, you have failed to demonstrate your one-sided hypothesis, but (ii) 
if they have the right ordering, you can multiply the significances given above by 
0.5, which makes them more significant. 

But keep in mind: These interpretations of the r statistic can be completely 
meaningless if the joint probability distribution of your variables x and y is too 
different from a binormal distribution. 

#include <math.h> 

#define TINY 1.0e-20 Will regularize the unusual case of complete correlation. 

void pearsn(float x[], float y[], unsigned long n, float *r, float *prob, 
float *z) 

Given two arrays x[l. .n] and y[l. .n], this routine computes their correlation coefficient 
r (returned as r), the significance level at which the null hypothesis of zero correlation is 
disproved (prob whose small value indicates a significant correlation), and Fisher's z (returned 
as z), whose value can be used in further statistical tests as described above. 

{ 

float betai(float a, float b, float x); 
float erfcc(float x); 
unsigned long j; 
float yt,xt,t,df; 

float syy=0.0,sxy=0.0,sxx=0.0, ay=0.0,ax=0.0; 

for (j=l;j<=n;j++) { Find the means, 

ax += x[j] ; 
ay += y[j]; 

} 

ax /= n; 
ay /= n; 

for (j=l;j<=n;j++) { Compute the correlation coefficient. 

xt=x[j]-ax; 

y t= y[j]- a y; 

sxx += xt*xt; 

s yy += yt*yt; 



s o- i 
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sxy += xt*yt; 

> 

*r=sxy/(sqrt(sxx*syy)+TINY); 

*z=0.5*log((1.0+(*r)+TINY)/(1.0-(*r)+TINY)); Fisher's z transformation. 
df=n-2; 

t=(*r)*sqrt(df/((1.0-(*r)+TINY)*(1.0+(*r)+TINY))) ; Equation (14.5.5). 

*prob=betai(0.5*df ,0.5,df/(df+t*t) ); Student's t probability. 

/* *prob=erfcc(fabs((*z)*sqrt(n-l.0))/l.4142136) */ 

For large n, this easier computation of prob, using the short routine erfcc, would give approx¬ 
imately the same value. 

> 


CITED REFERENCES AND FURTHER READING: 

Dunn, O.J., and Clark, V.A. 1974, Applied Statistics: Analysis of Variance and Regression (New 
York: Wiley). 

Hoel, RG. 1971, Introduction to Mathematical Statistics, 4th ed. (New York: Wiley), Chapter 7. 
von Mises, R. 1964, Mathematical Theory of Probability and Statistics (New York: Academic 
Press), Chapters IX(A) and IX(B). 

Korn, G.A., and Korn, T.M. 1968, Mathematical Handbook for Scientists and Engineers, 2nd ed. 
(New York: McGraw-Hill), §19.7. 

Norusis, M. J. 1982, SPSS Introductory Guide: Basic Statistics and Operations', and 1985, SPSS- 
X Advanced Statistics Guide (New York: McGraw-Hill). 


14.6 Nonparametric or Rank Correlation 

It is precisely the uncertainty in interpreting the significance of the linear 
correlation coefficient r that leads us to the important concepts of nonparametric or 
rank correlation. As before, we are given N pairs of measurements (xi, y % ). Before, 
difficulties arose because we did not necessarily know the probability distribution 
function from which the Xi’s or y^s were drawn. 

The key concept of nonparametric correlation is this: If we replace the value 
of each Xi by the value of its rank among all the other Xj’s in the sample, that 
is, 1,2,3,N, then the resulting list of numbers will be drawn from a perfectly 
known distribution function, namely uniformly from the integers between 1 and N, 
inclusive. Better than uniformly, in fact, since if the Xi’s are all distinct, then each 
integer will occur precisely once. If some of the x ^s have identical values, it is 
conventional to assign to all these “ties” the mean of the ranks that they would have 
had if their values had been slightly different. This midrank will sometimes be an 
integer, sometimes a half-integer. In all cases the sum of all assigned ranks will be 
the same as the sum of the integers from 1 to N, namely \N{N + 1). 

Of course we do exactly the same procedure for the y»’s, replacing each value 
by its rank among the other y,’s in the sample. 

Now we are free to invent statistics for detecting correlation between uniform 
sets of integers between 1 and N, keeping in mind the possibility of ties in the ranks. 
There is, of course, some loss of information in replacing the original numbers by 
ranks. We could construct some rather artificial examples where a correlation could 
be detected parametrically (e.g., in the linear correlation coefficient r), but could not 
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be detected nonparametrically. Such examples are very rare in real life, however, 
and the slight loss of information in ranking is a small price to pay for a very major 
advantage: When a correlation is demonstrated to be present nonparametrically, 
then it is really there! (That is, to a certainty level that depends on the significance 
chosen.) Nonparametric correlation is more robust than linear correlation, more 
resistant to unplanned defects in the data, in the same sort of sense that the median 
is more robust than the mean. For more on the concept of robustness, see §15.7. 

As always in statistics, some particular choices of a statistic have already been 
invented for us and consecrated, if not beatified, by popular use. We will discuss 
two, the Spearman rank-order correlation coefficient (r. s ), and Kendall’s tau (r). 

Spearman Rank-Order Correlation Coefficient 

Let Ri be the rank of x, among the other x’s, S t be the rank of y, among the 
other y’ s, ties being assigned the appropriate midrank as described above. Then the 
rank-order correlation coefficient is defined to be the linear correlation coefficient 
of the ranks, namely. 


rs = Ei(Ri-R)(Si-s) 

sfEiiRi-W^iSi-sy 

The significance of a nonzero value of r s is tested by computing 


t 



(14.6.2) 


which is distributed approximately as Student’s distribution with N — 2 degrees of 
freedom. A key point is that this approximation does not depend on the original 
distribution of the x’s and y’ s; it is always the same approximation, and always 
pretty good. 

It turns out that r s is closely related to another conventional measure of 
nonparametric correlation, the so-called sum squared difference of ranks, defined as 

N 

D = - Si ) 2 (14.6.3) 

»=l 

(This D is sometimes denoted D**, where the asterisks are used to indicate that 
ties are treated by midranking.) 

When there are no ties in the data, then the exact relation between D and r s is 

r * = 1 - ( 14 - 6 - 4 ) 

When there are ties, then the exact relation is slightly more complicated: Let fk be 
the number of ties in the fcth group of ties among the Rf s, and let g m be the number 
of ties in the mth group of ties among the Sf s. Then it turns out that 


TjV l D + h T,k(fk ~fk) + h - 9m) 

Ek(fk-fk) ] V2 [ _ E m (9 3 m-9m) 1 V2 
N 3 — N N 3 — N 


(14.6.5) 
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holds exactly. Notice that if all the fk’ s and all the g m ’s are equal to one, meaning 
that there are no ties, then equation (14.6.5) reduces to equation (14.6.4). 

In (14.6.2) we gave a t-statistic that tests the significance of a nonzero r s . It is 
also possible to test the significance of D directly. The expectation value of D in 
the null hypothesis of uncorrelated data sets is 




and it is approximately normally distributed, so that the significance level is a 
complementary error function (cf. equation 14.5.2). Of course, (14.6.2) and (14.6.7) 
are not independent tests, but simply variants of the same test. In the program that 
follows, we calculate both the significance level obtained by using (14.6.2) and the 
significance level obtained by using (14.6.7); their discrepancy will give you an idea 
of how good the approximations are. You will also notice that we break off the task 
of assigning ranks (including tied midranks) into a separate function, crank. 

#include <math.h> 

#include "nrutil.h" 

void spear(float datal[], float data2[], unsigned long n, float *d, float *zd, 
float *probd, float *rs, float *probrs) 

Given two data arrays, datal [1 . .n] and data2 [1. .n] , this routine returns their sum-squared 
difference of ranks as D, the number of standard deviations by which D deviates from its null- 
hypothesis expected value as zd, the two-sided significance level of this deviation as probd, 
Spearman's rank correlation r a as rs, and the two-sided significance level of its deviation from 
zero asprobrs. The external routines crank (below) and sort2 (§8.2) are used. A small value 
of either probd or probrs indicates a significant correlation (rs positive) or anticorrelation 
(rs negative). 

{ 

float betaiffloat a, float b, float x); 

void crank (unsigned long n, float w[], float *s); 

float erfcc(float x); 

void sort2(unsigned long n, float arr[], float brr[]); 
unsigned long j; 

float vard,t,sg,sf,fac.,en3n,en,df,aved,*wkspl,*wksp2; 

wkspl=vector(l,n); 
wksp2=vector(l,n); 
for (j=l;j<=n;j++) { 
wkspl [j]=datal [j] 
wksp2[j]=data2[j ] 

} 

sort2(n,wkspl,wksp2); 
crank(n,wkspl,fcsf); 
sort2(n,wksp2,wkspl); 
crank(n,wksp2,&sg); 

*d=0.0; 

for (j=l;j<=n;j++) 

*d += SQR(wkspl[j 


Sort each of the data arrays, and convert the entries to 
ranks. The values sf and sg return the sums X](/| — fk) 
and XXflm — 9m), respectively. 

Sum the squared difference of ranks. 

]-wksp2 [i] ) ; 
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en=n; 

en3n=en*en*en-erL; 
aved=en3n/6.0-(sf+sg)/12.0; 
fac=(l.O-sf/en3n)*(1,0-sg/en3n); 
vard=((en-1.0)*en*en*SQR(en+l.0)/36.0)*fac; 

*zd=(*d-aved)/sqrt(vard); 

*probd=erfcc(fabs(*zd)/l.4142136); 

*rs=(l,0-(6.0/en3n)*(*d+(sf+sg)/12.0))/sqrt(fac); 
fac=(*rs+1.0)*(1.0-(*rs)); 
if (fac > 0.0) { 

t=(*rs)*sqrt((en-2.0)/fac); 
df=en-2.0; 

*probrs=betai(0.5*df,0.5,df/(df+t*t)); 

> else 

*probrs=0.0; 
free_vector(wksp2,l,n); 
free_vector(wkspl,1,n); 


Expectation value of D, 

and variance of D give 
number of standard devia¬ 
tions and significance. 

Rank correlation coefficient, 


and its t value, 
give its significance. 


void crank(unsigned long n, float w[], float *s) 

Given a sorted array w [1. . n] , replaces the elements by their rank, including midranking of ties, 
and returns as s the sum of f 3 — f, where / is the number of elements in each tie. 

s 

unsigned long j=l,ji,jt; 
float t,rank; 


> 


*s=0.0; 

while (j < n) { 

if (w[j+l] != w[j]) { 

++ j; 

> else { 

for (jt=j+l;jt<=n &fc w[jt] 

rank=0.5*(j+jt-l); 

for (ji=j;ji<=(jt-l);ji++) 

*s += t; 

j=jt; 

> 

> 

if (j == n) w[n]=n; 


Not a tie. 


A tie: 

==w[j] ; jt++); How far does it go? 

This is the mean rank of the tie, 
w[ji]=rank; so enter it into all the tied 

entries, 

and update s. 


If the last element was not tied, this is its rank. 


Kendall’s Tau 

Kendall’s r is even more nonparametric than Spearman’s r s or D. Instead of 
using the numerical difference of ranks, it uses only the relative ordering of ranks: 
higher in rank, lower in rank, or the same in rank. But in that case we don’t even 
have to rank the data! Ranks will be higher, lower, or the same if and only if 
the values are larger, smaller, or equal, respectively. On balance, we prefer r s as 
being the more straightforward nonparametric test, but both statistics are in general 
use. In fact, r and r s are very strongly correlated and, in most applications, are 
effectively the same test. 

To define r, we start with the N data points ( Xi,yi ). Now consider all 
\N(N — 1) pairs of data points, where a data point cannot be paired with itself, 
and where the points in either order count as one pair. We call a pair concordant 
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if the relative ordering of the ranks of the two x’s (or for that matter the two x’s 
themselves) is the same as the relative ordering of the ranks of the two y’s (or for 
that matter the two y’s themselves). We call a pair discordant if the relative ordering 
of the ranks of the two x’s is opposite from the relative ordering of the ranks of the 
two y’s. If there is a tie in either the ranks of the two x’s or the ranks of the two 
y’s, then we don’t call the pair either concordant or discordant. If the tie is in the 
x’s, we will call the pair an “extra y pair.” If the tie is in the y’s, we will call the 
pair an “extra x pair.” If the tie is in both the x’s and the y’s, we don’t call the pair 
anything at all. Are you still with us? 

Kendall’s r is now the following simple combination of these various counts: 

concordant — discordant 

^concordant + discordant + extra-y ^/concordant + discordant + extra-x 

(14.6.8) 

You can easily convince yourself that this must lie between 1 and — 1, and that it 
takes on the extreme values only for complete rank agreement or complete rank 
reversal, respectively. 

More important, Kendall has worked out, from the combinatorics, the approx¬ 
imate distribution of r in the null hypothesis of no association between x and y. 
In this case r is approximately normally distributed, with zero expectation value 
and a variance of 


Var(r) 


AN + 10 
9N(N — 1) 


(14.6.9) 


The following program proceeds according to the above description, and 
therefore loops over all pairs of data points. Beware: This is an 0(N 2 ) algorithm, 
unlike the algorithm for r s , whose dominant sort operations are of order N log N. If 
you are routinely computing Kendall’s r for data sets of more than a few thousand 
points, you may be in for some serious computing. If, however, you are willing to 
bin your data into a moderate number of bins, then read on. 


#include <math.h> 

void kendll(float datal[], float data2[], unsigned long n, float *tau, 
float *z, float *prob) 

Given data arrays datal [1. .n] and data2 [1. . n] , this program returns Kendall's r as tau, 
its number of standard deviations from zero as z, and its two-sided significance level as prob. 
Small values of prob indicate a significant correlation (tau positive) or anticorrelation (tau 
negative). 

{ 

float erfcc(float x); 
unsigned long n2=0,nl=0,k,j; 
long is=0; 

float svar,aa,a2,al; 

for (j=l;j<n;j++) { 

for (k=(j+l);k<=n;k++) { 
al=datal [j] -datal [k] 
a2=data2 [j ] -data2 [k] 
aa=al*a2; 
if (aa) { 

++nl; 


Loop over first member of pair, 
and second member. 



Neither array has a tie. 
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++n2; 

aa > 0.0 ? ++is : —is; 

> else { 

if (al) ++nl; 
if (a2) ++n2; 

> 

> 

> 

*tau=is/(sqrt((double) nl)*sqrt((double) n2)); 
svar=(4.0*n+10.0)/(9.0*n*(n-l.0)); 

*z=(*tau)/sqrt(svar); 

*prob=erfcc(fabs(*z)/l.4142136); 


One or both arrays have ties. 
An “extra x" event. 

An “extra y" event. 


Equation (14.6.8). 
Equation (14.6.9). 

Significance. 


Sometimes it happens that there are only a few possible values each for x and 
y. In that case, the data can be recorded as a contingency table (see § 14.4) that gives 
the number of data points for each contingency of x and y. 

Spearman’s rank-order correlation coefficient is not a very natural statistic 
under these circumstances, since it assigns to each x and y bin a not-very-meaningful 
midrank value and then totals up vast numbers of identical rank differences. Kendall’s 
tau, on the other hand, with its simple counting, remains quite natural. Furthermore, 
its 0(N 2 ) algorithm is no longer a problem, since we can arrange for it to loop over 
pairs of contingency table entries (each containing many data points) instead of over 
pairs of data points. This is implemented in the program that follows. 

Note that Kendall’s tau can be applied only to contingency tables where both 
variables are ordinal, i.e., well-ordered, and that it looks specifically for monotonic 
correlations, not for arbitrary associations. These two properties make it less general 
than the methods of § 14.4, which applied to nominal, i.e., unordered, variables and 
arbitrary associations. 

Comparing kendll above with kendl2 below, you will see that we have 
“floated” a number of variables. This is because the number of events in a 
contingency table might be sufficiently large as to cause overflows in some of the 
integer arithmetic, while the number of individual data points in a list could not 
possibly be that large [for an 0(N 2 ) routine!]. 


#include <math.h> 


void kendl2(float **tab, int i, int j, float *tau, float *z, float *prob) 

Given a two-dimensional table tab[l. .i] [1. . j] , such that tab[k] [1] contains the number 
of events falling in bin k of one variable and bin 1 of another, this program returns Kendall’s r 
as tau, its number of standard deviations from zero as z, and its two-sided significance level as 
prob. Small values of prob indicate a significant correlation (tau positive) or anticorrelation 
(tau negative) between the two variables. Although tab is a float array, it will normally 
contain integral values. 

{. 

float erfcc(float x); 

long nn,nmi,m2,ml ,lj ,li,l,kj ,ki,k; 

float svar,s=0.0,points.pairs,en2=0.0,enl=0.0; 


nn=i*j; 

points=tab[i] [j] ; 
for (k=0; k<=nn-2; k++) { 
ki=(k/j); 
kj=k-j*ki; 

points += tab [ki+1] [kj+1] ; 
for (l=k+l;l<=nn-l;l++) { 


Total number of entries in contingency table. 

Loop over entries in table, 
decoding a row, 
and a column. 

Increment the total count of events. 

Loop over other member of the pair, 
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li=l/j; decoding its row 

lj=l-j*li; and column. 

mm=(ml=li-ki)*(m2=lj-kj); 
pairs=tab [ki+1] [kj+1] *tab [li+1] [lj+1] ; 
if (mm) { Not a tie. 

enl += pairs; 
en2 += pairs; 

s += (mm > 0 ? pairs : -pairs); Concordant, or discordant. 

> else { 

if (ml) enl += pairs; 
if (m2) en2 += pairs; 

> 

> 

> 

*tau=s/sqrt(enl*en2); 

svar=(4.0*points+10.0)/(9.0*points*(points-1.0)); 

*z=(*tau)/sqrt(svar); 

*prob=erfcc(fabs(*z)/l.4142136); 


CITED REFERENCES AND FURTHER READING: 

Lehmann, E.L. 1975, Nonparametrics: Statistical Methods Based on Ranks (San Francisco: 
Holden-Day). 

Downie, N.M., and Heath, R.W. 1965, Basic Statistical Methods, 2nd ed. (New York: Harper & 
Row), pp. 206-209. 

Norusis, M. J. 1982, SPSS Introductory Guide: Basic Statistics and Operations', and 1985, SPSS- 
X Advanced Statistics Guide (New York: McGraw-Hill). 


14.7 Do Two-Dimensional Distributions Differ? 


We here discuss a useful generalization of the K-S test (§14.3) to two-dimensional 
distributions. This generalization is due to Fasano and Franceschini [1 ], a variant on an earlier 
idea due to Peacock [2], 

In a two-dimensional distribution, each data point is characterized by an ( x,y ) pair of 
values. An example near to our hearts is that each of the 19 neutrinos that were detected 
from Supernova 1987A is characterized by a time U and by an energy E, (see [3]). We 
might wish to know whether these measured pairs (f<, E t ), i — 1... 19 are consistent with a 
theoretical model that predicts neutrino flux as a function of both time and energy — that is, 
a two-dimensional probability distribution in the ( x,y ) [here, (t,E)\ plane. That would be 
a one-sample test. Or, given two sets of neutrino detections, from two comparable detectors, 
we might want to know whether they are compatible with each other, a two-sample test. 

In the spirit of the tried-and-true, one-dimensional K-S test, we want to range over 
the (a:, y) plane in search of some kind of maximum cumulative difference between two 
two-dimensional distributions. Unfortunately, cumulative probability distribution is not 
well-defined in more than one dimension! Peacock’s insight was that a good surrogate is 
the integrated probability in each of four natural quadrants around a given point (xi,yi), 
namely the total probabilities (or fraction of data) in (a: > Xi,y > yf), (x < Xi,y > yf), 
(x < Xi,y < yf), ( x > Xi,y < yf). The two-dimensional K-S statistic D is now taken 
to be the maximum difference (ranging both over data points and over quadrants) of the 
corresponding integrated probabilities. When comparing two data sets, the value of D may 
depend on which data set is ranged over. In that case, define an effective D as the average 
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Figure 14.7.1. Two-dimensional distributions of 65 triangles and 35 squares. The two-dimensional K-S 
test finds that point one of whose quadrants (shown by dotted lines) maximizes the difference between 
fraction of triangles and fraction of squares. Then, equation (14.7.1) indicates whether the difference is 
statistically significant, i.e., whether the triangles and squares must have different underlying distributions. 

of the two values obtained. If you are confused at this point about the exact definition of D, 
don’t fret; the accompanying computer routines amount to a precise algorithmic definition. 

Figure 14.7.1 gives a feeling for what is going on. The 65 triangles and 35 squares seem 
to have somewhat different distributions in the plane. The dotted lines are centered on the 
triangle that maximizes the D statistic; the maximum occurs in the upper-left quadrant. That 
quadrant contains only 0.12 of all the triangles, but it contains 0.56 of all the squares. The 
value of D is thus 0.44. Is this statistically significant? 

Even for fixed sample sizes, it is unfortunately not rigorously true that the distribution of 
D in the null hypothesis is independent of the shape of the two-dimensional distribution. In this 
respect the two-dimensional K-S test is not as natural as its one-dimensional parent. However, 
extensive Monte Carlo integrations have shown that the distribution of the two-dimensional 
D is very nearly identical for even quite different distributions, as long as they have the same 
coefficient of correlation r, defined in the usual way by equation (14.5.1). In their paper, 
Fasano and Franceschini tabulate Monte Carlo results for (what amounts to) the distribution of 
D as a function of (of course) D, sample size N, and coefficient of correlation r. Analyzing 
their results, one finds that the significance levels for the two-dimensional K-S test can be 
summarized by the simple, though approximate, formulas, 

_ Vnd _ 

1 + yr^72(0.25 - 0.75 /VN) 


) 



Probability (D > observed ) = Qks 


(14.7.1) 
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for the one-sample case, and the same for the two-sample case, but with 




iViiVa 

Ni + N 2 ' 


(14.7.2) 


The above formulas are accurate enough when N ^ 20, and when the indicated 
probability (significance level) is less than (more significant than) 0.20 or so. When the 
indicated probability is > 0.20, its value may not be accurate, but the implication that the 
data and model (or two data sets) are not significantly different is certainly correct. Notice 
that in the limit of r —> 1 (perfect correlation), equations (14.7.1) and (14.7.2) reduce to 
equations (14.3.9) and (14.3.10): The two-dimensional data lie on a perfect straight line, and 
the two-dimensional K-S test becomes a one-dimensional K-S test. 

The significance level for the data in Figure 14.7.1, by the way, is about 0.001. This 
establishes to a near-certainty that the triangles and squares were drawn from different 
distributions. (As in fact they were.) 

Of course, if you do not want to rely on the Monte Carlo experiments embodied in 
equation (14.7.1), you can do your own: Generate a lot of synthetic data sets from your 
model, each one with the same number of points as the real data set. Compute D for each 
synthetic data set, using the accompanying computer routines (but ignoring their calculated 
probabilities), and count what fraction of the time these synthetic D ’s exceed the D from the 
real data. That fraction is your significance. 

One disadvantage of the two-dimensional tests, by comparison with their one-dimensional 
progenitors, is that the two-dimensional tests require of order N 2 operations: Two nested 
loops of order N take the place of an N log N sort. For small computers, this restricts the 
usefulness of the tests to N less than several thousand. 

We now give computer implementations. The one-sample case is embodied in the 
routine ks2dls (that is, 2-dimensions, 1-sample). This routine calls a straightforward utility 
routine quadct to count points in the four quadrants, and it calls a user-supplied routine 
quadvl that must be capable of returning the integrated probability of an analytic model in 
each of four quadrants around an arbitrary ( x,y ) point. A trivial sample quadvl is shown; 
realistic quadvls can be quite complicated, often incorporating numerical quadratures over 
analytic two-dimensional distributions. 


#include Cmath.h> 

#include "nrutil.h" 

void ks2dls (float xl[], float yl[], unsigned long nl, 

void (*quadvl)(float, float, float *, float *, float *, float *), 
float *dl, float *prob) 

Two-dimensional Kolmogorov-Smirnov test of one sample against a model. Given the x and y 
coordinates of nl data points in arrays xl [1. .nl] and yl [1. .nl] , and given a user-supplied 
function quadvl that exemplifies the model, this routine returns the two-dimensional K-S 
statistic as dl, and its significance level as prob. Small values of prob show that the sample 
is significantly different from the model. Note that the test is slightly distribution-dependent, 
so prob is only an estimate. 

{ 

void pearsn(float x[], float y[], unsigned long n, float *r, float *prob, 
float *z); 

float probks(float alam); 

void quadct (float x, float y, float xx[], float yy[], unsigned long nn, 
float *fa, float *fb, float *fc, float *fd); 
unsigned long j; 

float dum,dumm,fa,fb,fc,fd,ga,gb,gc,gd,rl,rr,sqen; 

*dl=0.0; 

for (j=l; j<=nl; j++) { Loop over the data points, 

quadct(xl[j],yl[j],xl,yl,nl,&fa,&fb,&fc,&fd); 

(*quadvl)(xl[j],yl[j],&ga,&gb,&gc,&gd); 

*dl=FMAX(*dl,fabs(fa-ga)); 

*dl=FMAX(*dl,fabs(fb-gb)); 

*dl=FMAX(*dl,fabs(fc-gc)); 
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*dl=FMAX(*dl,fabs(fd-gd)); 

For both the sample and the model, the distribution is integrated in each of four 
quadrants, and the maximum difference is saved. 

> 

pearsn(xl,yl,nl ,&rl, &dum, Mumm) ; Get the linear correlation coefficient rl. 

sqen=sqrt((double)nl); 
rr=sqrt(1.0-rl*rl); 

Estimate the probability using the K-S probability function probks. 

*prob=probks(*dl*sqen/(1.0+rr*(0.25-0.75/sqen))); 


void quadct(float x, float y, float xx[], float yy[] , unsigned long nn, 
float *fa, float *fb, float *fc, float *fd) 

Given an origin (x,y), and an array of nn points with coordinates xx[l. .nn] and yy[l. .nn] , 
count how many of them are in each quadrant around the origin, and return the normalized 
fractions. Quadrants are labeled alphabetically, counterclockwise from the upper right. Used 
by ks2dls and ks2d2s. 

{ 

unsigned long k,na,nb,nc,nd; 
float ff; 
na=nb=nc=nd=0; 
for (k=l;k<=nn;k++) { 
if (yy [k] > y) { 

xx[k] > x ? ++na : ++nb; 

} else { 

xx[k] > x ? ++nd : ++nc; 

> 

> 

ff=1.0/nn; 

*fa=ff*na; 

*fb=ff*nb; 

*fc=ff*nc; 

*fd=ff*nd; 


#include "nrutil.h" 

void quadvl(float x, float y, float *fa, float *fb, float *fc, float *fd) 

This is a sample of a user-supplied routine to be used with ks2dls. In this case, the model 
distribution is uniform inside the square — 1 < x < 1, — 1 < y < 1. In general this routine 
should return, for any point (x,y), the fraction of the total distribution in each of the four 
quadrants around that point. The fractions, fa, fb, fc, and fd, must add up to 1. Quadrants 
are alphabetical, counterclockwise from the upper right. 

{ 

float qa,qb,qc,qd; 

qa=FMIN(2.0,FMAX(0.0,1.0-x)); 
qb=FMIN(2.0,FMAX(0.0,1.0-y)); 
qc=FMIN(2.0,FMAX(0.0,x+l.0)); 
qd=FMIN(2.0,FMAX(0.0,y+1.0)); 

*fa=0.25*qa*qb; 

*fb=0.25*qb*qc; 

*fc=0.25*qc*qd; 

*fd=0.25*qd*qa; 



The routine ks2d2s is the two-sample case of the two-dimensional K-S test. It also calls 
quadct, pearsn, and probks. Being a two-sample test, it does not need an analytic model. 
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#include <math.h> 

#include "nrutil.h" 

void ks2d2s (float xl[], float yl[], unsigned long nl, float x2[], float y2[], 
unsigned long n2, float *d, float *prob) 

Two-dimensional Kolmogorov-Smirnov test on two samples. Given the x and y coordinates of 
the first sample as nl values in arrays xl [1. .nl] and yl [1. .nl] , and likewise for the second 
sample, n2 values in arrays x2 and y2, this routine returns the two-dimensional, two-sample 
K-S statistic as d, and its significance level as prob. Small values of prob show that the 
two samples are significantly different. Note that the test is slightly distribution-dependent, so 
prob is only an estimate. 

{ 

void pearsn(float x[], float y[], unsigned long n, float *r, float *prob, 
float *z); 

float probks(float alam); 

void quadct(float x, float y, float xx[], float yy[], unsigned long nn, 
float *fa, float *fb, float *fc, float *fd); 
unsigned long j; 

float dl,d2,dum,dumm,fa,fb,fc,fd,ga,gb,gc,gd,rl,r2,rr,sqen; 
dl=0.0; 

for (j=l;j<=nl;j++) { First, use points in the first sample as ori- 

quadct(xl [j] ,yl [j] ,xl,yl ,nl,&fa,&fb,&f c,&fd); gins. 

quadct(xl[j],yl[j],x2,y2,n2,&ga,Stgb,&gc,&gd); 
dl=FMAX(dl,fabs(fa-ga)); 
dl=FMAX(dl,fabs(fb-gb)); 
dl=FMAX(dl,fabs(fc-gc)) ; 
dl=FMAX(dl,fabs(fd-gd)) ; 

> 

d2=0.0; 

for (j=l;j<=n2;j++) { Then, use points in the second sample as 

quadct(x2[j] ,y2[j] ,xl,yl ,nl,&fa,&fb,&f c,&fd); origins, 

quadct (x2 [j ] , y2 [j ] , x2, y2, n2, Sega, Scgb ,& gc, &gd); 
d2=FMAX(d2,fabs(fa-ga)) ; 
d2=FMAX(d2,fabs(fb-gb)); 
d2=FMAX(d2,fabs(fc-gc)); 
d2=FMAX(d2,fabs(fd-gd)) ; 

> 

*d=0.5*(dl+d2); 

sqen=sqrt(nl*n2/(double)(nl+n2)); 
pearsn(xl,yl,nl,ferl,&dum,&dujmn); 
pearsn (x2, y2, n2, &r2, &dum, Mumm); 
rr=sqrt(1.0-0.5*(rl*rl+r2*r2)); 

Estimate the probability using the K-S probability function probks. 

*prob=probks(*d*sqen/(l.0+rr*(0.25-0.75/sqen))); 

> 


Average the K-S statistics. 

Get the linear correlation coefficient for each 
sample. 


CITED REFERENCES AND FURTHER READING: 

Fasano, G. and Franceschini, A. 1987, Monthly Notices of the Royal Astronomical Society, 
vol. 225, pp. 155-170. [1] 

Peacock, J.A. 1983, Monthly Notices of the Royal Astronomical Society, vol. 202, pp. 615-627. [2] 
Spergel, D.N., Piran, T., Loeb, A., Goodman, J., and Bahcall, J.N. 1987, Science, vol. 237, 
pp. 1471-1473. [3] 
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14.8 Savitzky-Golay Smoothing Filters 


In §13.5 we learned something about the construction and application of digital filters, 
but little guidance was given on which particular filter to use. That, of course, depends 
on what you want to accomplish by filtering. One obvious use for low-pass filters is to 
smooth noisy data. 

The premise of data smoothing is that one is measuring a variable that is both slowly 
varying and also corrupted by random noise. Then it can sometimes be useful to replace 
each data point by some kind of local average of surrounding data points. Since nearby 
points measure very nearly the same underlying value, averaging can reduce the level of noise 
without (much) biasing the value obtained. 

We must comment editorially that the smoothing of data lies in a murky area, beyond 
the fringe of some better posed, and therefore more highly recommended, techniques that are 
discussed elsewhere in this book. If you are fitting data to a parametric model, for example 
(see Chapter 15), it is almost always better to use raw data than to use data that has been 
pre-processed by a smoothing procedure. Another alternative to blind smoothing is so-called 
“optimal” or Wiener filtering, as discussed in §13.3 and more generally in §13.6. Data 
smoothing is probably most justified when it is used simply as a graphical technique, to guide 
the eye through a forest of data points all with large error bars; or as a means of making initial 
rough estimates of simple parameters from a graph. 

In this section we discuss a particular type of low-pass filter, well-adapted for data 
smoothing, and termed variously Savitzky-Golay [1 ], least-squares [2], or DISPO (Digital 
Smoothing Polynomial) [3] filters. Rather than having their properties defined in the Fourier 
domain, and then translated to the time domain, Savitzky-Golay filters derive directly from 
a particular formulation of the data smoothing problem in the time domain, as we will now 
see. Savitzky-Golay filters were initially (and are still often) used to render visible the relative 
widths and heights of spectral lines in noisy spectrometric data. 

Recall that a digital filter is applied to a series of equally spaced data values fi = f(U), 
where ti = to + iA for some constant sample spacing A and i = ... — 2, —1,0,1, 2,.... 
We have seen (§13.5) that the simplest type of digital filter (the nonrecursive or finite impulse 
response filter) replaces each data value fi hy a linear combination g, of itself and some 
number of nearby neighbors, 

nR 

9i= E c ^+" (14-8.1) 

n=-UL 

Here ul is the number of points used “to the left” of a data point i, i.e., earlier than it, while 
riR is the number used to the right, i.e., later. A so-called causal filter would have nn = 0. 

As a starting point for understanding Savitzky-Golay filters, consider the simplest 
possible averaging procedure: For some fixed til = nit, compute each g, as the average of 
the data points from f%- nL to fi+ nR ■ This is sometimes called moving window averaging 
and corresponds to equation (14.8.1) with constant Cn = 1 /(nr, +tir + 1). If the underlying 
function is constant, or is changing linearly with time (increasing or decreasing), then no 
bias is introduced into the result. Higher points at one end of the averaging interval are on 
the average balanced by lower points at the other end. A bias is introduced, however, if 
the underlying function has a nonzero second derivative. At a local maximum, for example, 
moving window averaging always reduces the function value. In the spectrometric application, 
a narrow spectral line has its height reduced and its width increased. Since these parameters 
are themselves of physical interest, the bias introduced is distinctly undesirable. 

Note, however, that moving window averaging does preserve the area under a spectral 
line, which is its zeroth moment, and also (if the window is symmetric with til = ur) its 
mean position in time, which is its first moment. What is violated is the second moment, 
equivalent to the line width. 

The idea of Savitzky-Golay filtering is to find filter coefficients c n that preserve higher 
moments. Equivalently, the idea is to approximate the underlying function within the moving 
window not by a constant (whose estimate is the average), but by a polynomial of higher 
order, typically quadratic or quartic: For each point fi, we least-squares fit a polynomial to all 
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M 

til 

nR 

Sample Savitzky-Golay Coefficients 

2 

2 

2 

-0.086 0.343 

0.486 

0.343 -0.086 

2 

3 

1 

-0.143 0.171 0.343 

0.371 

0.257 

2 

4 

0 

0.086 -0.143 -0.086 0.257 

0.886 


2 

5 

5 

-0.084 0.021 0.103 0.161 0.196 

0.207 

0.196 0.161 0.103 0.021 -0.084 

4 

4 

4 

0.035 -0.128 0.070 0.315 

0.417 

0.315 0.070 -0.128 0.035 

4 

5 

5 

0.042 -0.105 -0.023 0.140 0.280 

0.333 

0.280 0.140 -0.023 -0.105 0.042 


til + ur + 1 points in the moving window, and then set ft to be the value of that polynomial 
at position i. (If you are not familiar with least-squares fitting, you might want to look ahead 
to Chapter 15.) We make no use of the value of the polynomial at any other point. When we 
move on to the next point fi+i, we do a whole new least-squares fit using a shifted window. 

All these least-squares fits would be laborious if done as described. Luckily, since the 
process of least-squares fitting involves only a linear matrix inversion, the coefficients of a 
fitted polynomial are themselves linear in the values of the data. That means that we can do 
all the fitting in advance, for fictitious data consisting of all zeros except for a single 1, and 
then do the fits on the real data just by taking linear combinations. This is the key point, then: 
There are particular sets of filter coefficients c„ for which equation (14.8.1) “automatically” 
accomplishes the process of polynomial least-squares fitting inside a moving window. 

To derive such coefficients, consider how go might be obtained: We want to fit a 
polynomial of degree M in i, namely o,o + aii + • ■ • + aMi M to the values f-n L , ■ ■ ■, fn R ■ 
Then go will be the value of that polynomial at i = 0, namely oo. The design matrix for 
this problem (§15.4) is 

Aij=i j i = —til, ■ ■ ■ ,ur, j = 0,...,M (14.8.2) 

and the normal equations for the vector of a,j ’s in terms of the vector of /,’s is in matrix notation 
(A t • A) • a = A t • f or a = (A T • A) -1 • (A T • f) (14.8.3) 

We also have the specific forms 

nR n R 

{a t -a} = A kiAkj= ki+j (14.8.4) 

13 k=-n L k=-n L 

and 

{A T - f } = £ 4wA= £ kj fk (14.8.5) 

3 k=—n L k=-n L 

Since the coefficient c n is the component ao when f is replaced by the unit vector e„, 
—til < n < ur, we have 

M 

c n = {(A T .A)- 1 .(A T -e n )} o = {( AT ' A ) _1 } 0m nm (14-8.6) 

Note that equation (14.8.6) says that we need only one row of the inverse matrix. (Numerically 
we can get this by LU decomposition with only a single backsubstitution.) 

The function savgol, below, implements equation (14.8.6). As input, it takes the 
parameters nl = nr,, nr = ur, and m = M (the desired order). Also input is np, the 
physical length of the output array c, and a parameter Id which for data fitting should be 
zero. In fact. Id specifies which coefficient among the cn’s should be returned, and we are 
here interested in ao. For another purpose, namely the computation of numerical derivatives 
(already mentioned in §5.7) the useful choice is Id > 1. With Id = 1, for example, the 
filtered first derivative is the convolution (14.8.1) divided by the stepsize A. For Id = k > 1, 
the array c must be multiplied by k\ to give derivative coefficients. For derivatives, one 
usually wants m = 4 or larger. 
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#include <math.h> 

#include "nrutil.h" 

void savgol(float c[], int np, int nl, int nr, int Id, int m) 

Returns in c[l. .np], in wrap-around order (N.B.!) consistent with the argument respns in 
routine convlv, a set of Savitzky-Golay filter coefficients, nl is the number of leftward (past) 
data points used, while nr is the number of rightward (future) data points, making the total 
number of data points used nl + nr + 1. Id is the order of the derivative desired (e.g., Id = 0 
for smoothed function), m is the order of the smoothing polynomial, also equal to the highest 
conserved moment; usual values are m = 2 or m = 4. 

t 

void lubksb(float **a, int n, int *indx, float b[]); 
void ludcmp(float **a, int n, int *indx, float *d); 
int imj ,ipj , j ,k,kk,mm,*indx; 
float d,fac,sum,**a,*b; 

if (np < nl+nr+1 I I nl<0 II nr<0 II ld>m I I nl+nr < m) 

nrerror("bad args in savgol"); 

indx=ivector(l,m+l); 

a=matrix(l,m+l,1,m+l); 

b=vector(l,m+l); 

for (ipj=0;ipj<=(m « l);ipj++) { Set up the normal equations of the desired 
sum=(ipj ? 0.0 : 1.0); least-squares fit. 

for (k=l;k<=nr;k++) sum += pow((double)k,(double)ipj); 
for (k=l;k<=nl;k++) sum += pow((double)-k,(double)ipj); 
mm=IMIN(ipj,2*m-ipj); 

for (imj = -mm;imj<=mm;imj+=2) a[l+(ipj+imj)/2][l+(ipj-imj)/2]=sum; 

> 

ludcmp(a,m+l,indx,fed); Solve them: LU decomposition, 

for (j=l;j<=m+l;j++) b[j]=0.0; 
b[ld+1]=1.0; 

Right-hand side vector is unit vector, depending on which derivative we want. 
lubksb(a,m+l,indx,b) ; Get one row of the inverse matrix, 

for (kk=l;kk<=np;kk++) c[kk]=0.0; Zero the output array (it may be bigger than 

for (k = -nl;k<=nr;k++) { number of coefficients). 

sum=b [1] ; Each Savitzky-Golay coefficient is the dot 

fac=l .0; product of powers of an integer with the 

for (mm=l;mm<=m;mm++) sum += b[mm+l]*(fac *= k) ; inverse matrix row. 
kk=((np-k) ’/. np)+l; Store in wrap-around order, 

c [kk] =sum; 

> 

free_vector(b,1,m+l); 
free_matrix(a,l,m+l,l,m+l); 
free_ivector(indx,l,m+l); 

> 


As output, savgol returns the coefficients Cn, for —til < n < hr. These are stored in 
c in “wrap-around order”; that is, Co is in c [1], c_i is in c [2], and so on for further negative 
indices. The value ci is stored in c [np], ci in c [np-1], and so on for positive indices. This 
order may seem arcane, but it is the natural one where causal filters have nonzero coefficients 
in low array elements of c. It is also the order required by the function convlv in §13.1, 
which can be used to apply the digital filter to a data set. 

The accompanying table shows some typical output from savgol. For orders 2 and 
4, the coefficients of Savitzky-Golay filters with several choices of til and tir are shown. 
The central column is the coefficient applied to the data fi in obtaining the smoothed g,. 
Coefficients to the left are applied to earlier data; to the right, to later. The coefficients 
always add (within roundoff error) to unity. One sees that, as befits a smoothing operator, 
the coefficients always have a central positive lobe, but with smaller, outlying corrections of 
both positive and negative sign. In practice, the Savitzky-Golay filters are most useful for 
much larger values of hl and tir, since these few-point formulas can accomplish only a 
relatively small amount of smoothing. 

Figure 14.8.1 shows a numerical experiment using a 33 point smoothing filter, that is, 
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Figure 14.8.1. Top: Synthetic noisy data consisting of a sequence of progressively narrower bumps, 
and additive Gaussian white noise. Center: Result of smoothing the data by a simple moving window 
average. The window extends 16 points leftward and rightward, for a total of 33 points. Note that narrow 
features are broadened and suffer corresponding loss of amplitude. The dotted curve is the underlying 
function used to generate the synthetic data. Bottom: Result of smoothing the data by a Savitzky-Golay 
smoothing filter (of degree 4) using the same 33 points. While there is less smoothing of the broadest 
feature, narrower features have their heights and widths preserved. 



til = ur = 16. The upper panel shows a test function, constructed to have six “bumps” of 
varying widths, all of height 8 units. To this function Gaussian white noise of unit variance 
has been added. (The test function without noise is shown as the dotted curves in the center 
and lower panels.) The widths of the bumps (full width at half of maximum, or FWHM) are 
140, 43, 24, 17, 13, and 10, respectively. 

The middle panel of Figure 14.8.1 shows the result of smoothing by a moving window 
average. One sees that the window of width 33 does quite a nice job of smoothing the broadest 
bump, but that the narrower bumps suffer considerable loss of height and increase of width. 
The underlying signal (dotted) is very badly represented. 

The lower panel shows the result of smoothing with a Savitzky-Golay filter of the 
identical width, and degree M = 4. One sees that the heights and widths of the bumps are 
quite extraordinarily preserved. A trade-off is that the broadest bump is less smoothed. That 
is because the central positive lobe of the Savitzky-Golay filter coefficients fills only a fraction 
of the full 33 point width. As a rough guideline, best results are obtained when the full width 
of the degree 4 Savitzky-Golay filter is between 1 and 2 times the FWHM of desired features 
in the data. (References [3] and [4] give additional practical hints.) 

Figure 14.8.2 shows the result of smoothing the same noisy “data” with broader 
Savitzky-Golay filters of 3 different orders. Here we have hl = ur = 32 (65 point filter) 
and M = 2,4, 6. One sees that, when the bumps are too narrow with respect to the filter 
size, then even the Savitzky-Golay filter must at some point give out. The higher order filter 
manages to track narrower features, but at the cost of less smoothing on broad features. 

To summarize: Within limits, Savitzky-Golay filtering does manage to provide smoothing 
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Figure 14.8.2. Result of applying wider 65 point Savitzky-Golay filters to the same data set as in Figure 
14.8.1. Top: degree 2. Center: degree 4. Bottom: degree 6. All of these filters are inoptimally broad 
for the resolution of the narrow features. Higher-order filters do best at preserving feature heights and 
widths, but do less smoothing on broader features. 



without loss of resolution. It does this by assuming that relatively distant data points have 
some significant redundancy that can be used to reduce the level of noise. The specific nature 
of the assumed redundancy is that the underlying function should be locally well-fitted by a 
polynomial. When this is true, as it is for smooth line profiles not too much narrower than 
the filter width, then the performance of Savitzky-Golay filters can be spectacular. When it 
is not true, then these filters have no compelling advantage over other classes of smoothing 
filter coefficients. 

A last remark concerns irregularly sampled data, where the values fi are not uniformly 
spaced in time. The obvious generalization of Savitzky-Golay filtering would be to do a 
least-squares fit within a moving window around each data point, one containing a fixed 
number of data points to the left (til) and right (nn_). Because of the irregular spacing, 
however, there is no way to obtain universal filter coefficients applicable to more than one 
data point. One must instead do the actual least-squares fits for each data point. This becomes 
computationally burdensome for larger til, ur, and M. 

As a cheap alternative, one can simply pretend that the data points are equally spaced. 
This amounts to virtually shifting, within each moving window, the data points to equally 
spaced positions. Such a shift introduces the equivalent of an additional source of noise 
into the function values. In those cases where smoothing is useful, this noise will often be 
much smaller than the noise already present. Specifically, if the location of the points is 
approximately random within the window, then a rough criterion is this: If the change in / 
across the full width of the N = til + ur + 1 point window is less than y/N /2 times the 
measurement noise on a single point, then the cheap method can be used. 
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Chapter 15. Modeling of Data 


15.0 Introduction 

Given a set of observations, one often wants to condense and summarize the 
data by fitting it to a “model” that depends on adjustable parameters. Sometimes the 
model is simply a convenient class of functions, such as polynomials or Gaussians, 
and the fit supplies the appropriate coefficients. Other times, the model’s parameters 
come from some underlying theory that the data are supposed to satisfy; examples 
are coefficients of rate equations in a complex network of chemical reactions, or 
orbital elements of a binary star. Modeling can also be used as a kind of constrained 
interpolation, where you want to extend a few data points into a continuous function, 
but with some underlying idea of what that function should look like. 

The basic approach in all cases is usually the same: You choose or design a 
figure-of-merit function (“merit function,” for short) that measures the agreement 
between the data and the model with a particular choice of parameters. The merit 
function is conventionally arranged so that small values represent close agreement. 
The parameters of the model are then adjusted to achieve a minimum in the merit 
function, yielding best-fit parameters. The adjustment process is thus a problem in 
minimization in many dimensions. This optimization was the subject of Chapter 10; 
however, there exist special, more efficient, methods that are specific to modeling, 
and we will discuss these in this chapter. 

There are important issues that go beyond the mere finding of best-fit parameters. 
Data are generally not exact. They are subject to measurement errors (called noise 
in the context of signal-processing). Thus, typical data never exactly fit the model 
that is being used, even when that model is correct. We need the means to assess 
whether or not the model is appropriate, that is, we need to test the goodness-of-fit 
against some useful statistical standard. 

We usually also need to know the accuracy with which parameters are determined 
by the data set. In other words, we need to know the likely errors of the best-fit 
parameters. 

Finally, it is not uncommon in fitting data to discover that the merit function 
is not unimodal, with a single minimum. In some cases, we may be interested in 
global rather than local questions. Not, “how good is this fit?” but rather, “how sure 
am I that there is not a very much better fit in some corner of parameter space?” 
As we have seen in Chapter 10, especially §10.9, this kind of problem is generally 
quite difficult to solve. 

The important message we want to deliver is that fitting of parameters is not 
the end-all of parameter estimation. To be genuinely useful, a fitting procedure 
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should provide (i) parameters, (ii) error estimates on the parameters, and (iii) a 
statistical measure of goodness-of-fit. When the third item suggests that the model 
is an unlikely match to the data, then items (i) and (ii) are probably worthless. 
Unfortunately, many practitioners of parameter estimation never proceed beyond 
item (i). They deem a fit acceptable if a graph of data and model “looks good.” This 
approach is known as chi-by-eye. Luckily, its practitioners get what they deserve. 


CITED REFERENCES AND FURTHER READING: 

Bevington, RR. 1969, Data Reduction and Error Analysis for the Physical Sciences (New York: 
McGraw-Hill). 

Brownlee, K.A. 1965, Statistical Theory and Methodology, 2nd ed. (New York: Wiley). 

Martin, B.R. 1971, Statistics for Physicists (New York: Academic Press), 
von Mises, R. 1964, Mathematical Theory of Probability and Statistics (New York: Academic 
Press), Chapter X. 

Korn, G.A., and Korn, T.M. 1968, Mathematical Handbook for Scientists and Engineers, 2nd ed. 
(New York: McGraw-Hill), Chapters 18-19. 


15.1 Least Squares as a Maximum Likelihood 
Estimator 


Suppose that we are fitting TV data points (x,., y, : ) i = 1,TV, to a model that 
has M adjustable parameters aj, j = 1,..., M. The model predicts a functional 
relationship between the measured independent and dependent variables, 

y(x) = y(x-, ai... a M ) (15.1.1) 

where the dependence on the parameters is indicated explicitly on the right-hand side. 

What, exactly, do we want to minimize to get fitted values for the a/s? The 
first thing that comes to mind is the familiar least-squares fit, 

1 v 

minimize over a i... aM '■ '^2[y i -y(x i ;a 1 ...a M )] 2 (15.1.2) 

i=l 

But where does this come from? What general principles is it based on? The answer 
to these questions takes us into the subject of maximum likelihood estimators. 

Given a particular data set of Xi s and y*’s, we have the intuitive feeling that 
some parameter sets ai... um are very unlikely — those for which the model 
function y(x) looks nothing like the data — while others may be very likely — those 
that closely resemble the data. How can we quantify this intuitive feeling? How can 
we select fitted parameters that are “most likely” to be correct? It is not meaningful 
to ask the question, “What is the probability that a particular set of fitted parameters 
ai... cim is correct?” The reason is that there is no statistical universe of models 
from which the parameters are drawn. There is just one model, the correct one, and 
a statistical universe of data sets that are drawn from it! 
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That being the case, we can, however, turn the question around, and ask, “Given 
a particular set of parameters, what is the probability that this data set could have 
occurred?” If the yfs take on continuous values, the probability will always be 
zero unless we add the phrase, “...plus or minus some fixed Ay on each data point.” 
So let’s always take this phrase as understood. If the probability of obtaining the 
data set is infinitesimally small, then we can conclude that the parameters under 
consideration are “unlikely” to be right. Conversely, our intuition tells us that the 
data set should not be too improbable for the correct choice of parameters. 

In other words, we identify the probability of the data given the parameters 
(which is a mathematically computable number), as the likelihood of the parameters 
given the data. This identification is entirely based on intuition. It has no formal 
mathematical basis in and of itself; as we already remarked, statistics is not a 
branch of mathematics! 

Once we make this intuitive identification, however, it is only a small further 
step to decide to fit for the parameters a\... a m precisely by finding those values 
that maximize the likelihood defined in the above way. This form of parameter 
estimation is maximum likelihood estimation. 

We are now ready to make the connection to (15.1.2). Suppose that each data 
point yi has a measurement error that is independently random and distributed as a 
normal (Gaussian) distribution around the “true” model y(x). And suppose that the 
standard deviations o of these normal distributions are the same for all points. Then 
the probability of the data set is the product of the probabilities of each point, 


at ( 

P ex JJ < exp 


1 ( Vi - y{xj ) 

2 V a 



(15.1.3) 


Notice that there is a factor Ay in each term in the product. Maximizing (15.1.3) is 
equivalent to maximizing its logarithm, or minimizing the negative of its logarithm, 
namely. 


l=i 2a 2 


— N log Ay 


(15.1.4) 


Since N, a, and Ay are all constants, minimizing this equation is equivalent to 
minimizing (15.1.2). 

What we see is that least-squares fitting is a maximum likelihood estimation 
of the fitted parameters if the measurement errors are independent and normally 
distributed with constant standard deviation. Notice that we made no assumption 
about the linearity or nonlinearity of the model y(x;a i...) in its parameters 
ai... a_M- Just below, we will relax our assumption of constant standard deviations 
and obtain the very similar formulas for what is called “chi-square fitting” or 
“weighted least-squares fitting.” First, however, let us discuss further our very 
stringent assumption of a normal distribution. 

For a hundred years or so, mathematical statisticians have been in love with 
the fact that the probability distribution of the sum of a very large number of very 
small random deviations almost always converges to a normal distribution. (For 
precise statements of this central limit theorem, consult [1 ] or other standard works 
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on mathematical statistics.) This infatuation tended to focus interest away from the 
fact that, for real data, the normal distribution is often rather poorly realized, if it is 
realized at all. We are often taught, rather casually, that, on average, measurements 
will fall within ±cr of the true value 68 percent of the time, within ±2 o 95 percent 
of the time, and within ±3<r 99.7 percent of the time. Extending this, one would 
expect a measurement to be off by ±20cr only one time out of 2 x 10 88 . We all 
know that “glitches” are much more likely than that] 

In some instances, the deviations from a normal distribution are easy to 
understand and quantify. For example, in measurements obtained by counting 
events, the measurement errors are usually distributed as a Poisson distribution, 
whose cumulative probability function was already discussed in §6.2. When the 
number of counts going into one data point is large, the Poisson distribution converges 
towards a Gaussian. However, the convergence is not uniform when measured in 
fractional accuracy. The more standard deviations out on the tail of the distribution, 
the larger the number of counts must be before a value close to the Gaussian is 
realized. The sign of the effect is always the same: The Gaussian predicts that “tail” 
events are much less likely than they actually (by Poisson) are. This causes such 
events, when they occur, to skew a least-squares fit much more than they ought. 

Other times, the deviations from a normal distribution are not so easy to 
understand in detail. Experimental points are occasionally just way off. Perhaps 
the power flickered during a point’s measurement, or someone kicked the apparatus, 
or someone wrote down a wrong number. Points like this are called outliers. 
They can easily turn a least-squares fit on otherwise adequate data into nonsense. 
Their probability of occurrence in the assumed Gaussian model is so small that the 
maximum likelihood estimator is willing to distort the whole curve to try to bring 
them, mistakenly, into line. 

The subject of robust statistics deals with cases where the normal or Gaussian 
model is a bad approximation, or cases where outliers are important. We will discuss 
robust methods briefly in §15.7. All the sections between this one and that one 
assume, one way or the other, a Gaussian model for the measurement errors in the 
data. It it quite important that you keep the limitations of that model in mind, even 
as you use the very useful methods that follow from assuming it. 

Finally, note that our discussion of measurement errors has been limited to 
statistical errors, the kind that will average away if we only take enough data. 
Measurements are also susceptible to systematic errors that will not go away with 
any amount of averaging. For example, the calibration of a metal meter stick might 
depend on its temperature. If we take all our measurements at the same wrong 
temperature, then no amount of averaging or numerical processing will correct for 
this unrecognized systematic error. 


Chi-Square Fitting 

We considered the chi-square statistic once before, in §14.3. Here it arises 
in a slightly different context. 

If each data point ( Xi , y t ) has its own, known standard deviation a*, then 
equation (15.1.3) is modified only by putting a subscript i on the symbol <j. That 
subscript also propagates docilely into (15.1.4), so that the maximum likelihood 
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estimate of the model parameters is obtained by minimizing the quantity 

x 2 s g - !/te«■' - ■ »«) y (J5.J.5) 


called the “chi-square.” 

To whatever extent the measurement errors actually are normally distributed, the 
quantity x 2 is correspondingly a sum of N squares of normally distributed quantities, 
each normalized to unit variance. Once we have adjusted the a i... a m to minimize 
the value of x 2 , the terms in the sum are not all statistically independent. For models 
that are linear in the a’s, however, it turns out that the probability distribution for 
different values of x 2 at its minimum can nevertheless be derived analytically, and 
is the chi-square distribution for N — M degrees of freedom. We learned how to 
compute this probability function using the incomplete gamma function gammq in 
§6.2. In particular, equation (6.2.18) gives the probability Q that the chi-square 
should exceed a particular value % 2 by chance, where v = N — M is the number of 
degrees of freedom. The quantity Q, or its complement P = 1 — Q, is frequently 
tabulated in appendices to statistics books, but we generally find it easier to use 
gammq and compute our own values: Q = gammq (0.5 v, 0.5x 2 ). It is quite common, 
and usually not too wrong, to assume that the chi-square distribution holds even for 
models that are not strictly linear in the a’s. 

This computed probability gives a quantitative measure for the goodness-of-fit 
of the model. If Q is a very small probability for some particular data set, then the 
apparent discrepancies are unlikely to be chance fluctuations. Much more probably 
either (i) the model is wrong — can be statistically rejected, or (ii) someone has lied to 
you about the size of the measurement errors ay — they are really larger than stated. 

It is an important point that the chi-square probability Q does not directly 
measure the credibility of the assumption that the measurement errors are normally 
distributed. It assumes they are. In most, but not all, cases, however, the effect of 
nonnormal errors is to create an abundance of outlier points. These decrease the 
probability Q, so that we can add another possible, though less definitive, conclusion 
to the above list: (iii) the measurement errors may not be normally distributed. 

Possibility (iii) is fairly common, and also fairly benign. It is for this reason 
that reasonable experimenters are often rather tolerant of low probabilities Q. It is 
not uncommon to deem acceptable on equal terms any models with, say, Q > 0.001. 
This is not as sloppy as it sounds: Truly wrong models will often be rejected with 
vastly smaller values of Q, 10 -18 , say. However, if day-in and day-out you find 
yourself accepting models with Q ~ 10 _3 , you really should track down the cause. 

If you happen to know the actual distribution law of your measurement errors, 
then you might wish to Monte Carlo simulate some data sets drawn from a particular 
model, cf. §7.2—§7.3. You can then subject these synthetic data sets to your actual 
fitting procedure, so as to determine both the probability distribution of the x 2 
statistic, and also the accuracy with which your model parameters are reproduced 
by the fit. We discuss this further in §15.6. The technique is very general, but it 
can also be very expensive. 

At the opposite extreme, it sometimes happens that the probability Q is too large, 
too near to 1, literally too good to be true! Nonnormal measurement errors cannot 
in general produce this disease, since the normal distribution is about as “compact” 
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as a distribution can be. Almost always, the cause of too good a chi-square fit 
is that the experimenter, in a “fit” of conservativism, has overestimated his or her 
measurement errors. Very rarely, too good a chi-square signals actual fraud, data 
that has been “fudged” to fit the model. 

A rule of thumb is that a “typical” value of x 2 for a “moderately” good fit is 
X 2 ft* v. More precise is the statement that the x 2 statistic has a mean v and a standard 
deviation y/2u, and, asymptotically for large v, becomes normally distributed. 

In some cases the uncertainties associated with a set of measurements are not 
known in advance, and considerations related to x 2 fitting are used to derive a value 
for a. If we assume that all measurements have the same standard deviation, o t = a, 
and that the model does fit well, then we can proceed by first assigning an arbitrary 
constant a to all points, next fitting for the model parameters by minimizing x 2 » 
and finally recomputing 


N 

0-2 =X^ i- y( Xi )] 2 /( Ar-M ) (15.1.6) 

*=i 

Obviously, this approach prohibits an independent assessment of goodness-of-fit, a 
fact occasionally missed by its adherents. When, however, the measurement error 
is not known, this approach at least allows some kind of error bar to be assigned 
to the points. 

If we take the derivative of equation (15.1.5) with respect to the parameters a k , 
we obtain equations that must hold at the chi-square minimum, 


0 = 


Vi ~ y(xi)\ (dy (xj-,... a k ...) 

da k 


k = 1, 


, M 


(15.1.7) 


Equation (15.1.7) is, in general, a set of M nonlinear equations for the M unknown 
a k . Various of the procedures described subsequently in this chapter derive from 
(15.1.7) and its specializations. 


CITED REFERENCES AND FURTHER READING: 

Bevington, RR. 1969, Data Reduction and Error Analysis for the Physical Sciences (New York: 
McGraw-Hill), Chapters 1-4. 

von Mises, R. 1964, Mathematical Theory of Probability and Statistics (New York: Academic 
Press), §VI.C. [1] 



15.2 Fitting Data to a Straight Line 




A concrete example will make the considerations of the previous section more 
meaningful. We consider the problem of fitting a set of N data points (a:*, t/*) to 
a straight-line model 


y(x ) = y{x\ a,b) = a+bx 


(15.2.1) 
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This problem is often called linear regression, a terminology that originated, long 
ago, in the social sciences. We assume that the uncertainty cr, associated with 
each measurement yi is known, and that the x,’s (values of the dependent variable) 
are known exactly. 

To measure how well the model agrees with the data, we use the chi-square 
merit function (15.1.5), which in this case is 

(15.2.2) 

If the measurement errors are normally distributed, then this merit function will give 
maximum likelihood parameter estimations of a and b\ if the errors are not normally 
distributed, then the estimations are not maximum likelihood, but may still be useful 
in a practical sense. In §15.7, we will treat the case where outlier points are so 
numerous as to render the x 2 merit function useless. 

Equation (15.2.2) is minimized to determine a and b. At its minimum, 
derivatives of x 2 (a, b) with respect to a,b vanish. 

n _ d X* _ o V'' Vi ~ a ~ bXi 

da ~ 4- of 

Z ~ (15.2.3) 

n _ d X 2 _ x i(Vi - a ~ bxi) 

db ~ ^ .iff" r 

These conditions can be rewritten in a convenient form if we define the following 
sums: 


N 1 N N 

No iv 

C _ X i q — V'' Xiyi 

Oxx — 2 _2 


(15.2.4) 


With these definitions (15.2.3) becomes 

aS + bS x = S y 
aS x + bS xx = S xy 

The solution of these two equations in two unknowns is calculated as 

A = SS XX - {S x f 
S XX S V - S X S XV 


(15.2.5) 


b = 


A 

SS XV - S X Sy 


(15.2.6) 



Equation (15.2.6) gives the solution for the best-fit model parameters a and b. 
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We are not done, however. We must estimate the probable uncertainties in 
the estimates of a and b, since obviously the measurement errors in the data must 
introduce some uncertainty in the determination of those parameters. If the data 
are independent, then each contributes its own bit of uncertainty to the parameters. 
Consideration of propagation of errors shows that the variance cr ‘j in the value of 
any function will be 


<7 


2 

/ 



(15.2.7) 


For the straight line, the derivatives of a and b with respect to y t can be directly 
evaluated from the solution: 


da _ S xx — S x Xi 
dyi of A 
db _ Sxi - S x 

dyi <7? A 

Summing over the points as in (15.2.7), we get 


(15.2.8) 


a l = S xx / A 
a 2 b = 5/A 


(15.2.9) 


which are the variances in the estimates of a and b, respectively. We will see in 
§15.6 that an additional number is also needed to characterize properly the probable 
uncertainty of the parameter estimation. That number is the covariance of a and b , 
and (as we will see below) is given by 

Cov(a, b) = -S x /A (15.2.10) 

The coefficient of correlation between the uncertainty in a and the uncertainty 
in b, which is a number between —1 and 1, follows from (15.2.10) (compare 
equation 14.5.1), 



(15.2.11) 


A positive value of r a b indicates that the errors in a and b are likely to have the 
same sign, while a negative value indicates the errors are anticorrelated, likely to 
have opposite signs. 

We are still not done. We must estimate the goodness-of-fit of the data to the 
model. Absent this estimate, we have not the slightest indication that the parameters 
a and b in the model have any meaning at all! The probability Q that a value of 
chi-square as poor as the value (15.2.2) should occur by chance is 


N -2 
2 ’ 2 ) 



Q = gammq 


(15.2.12) 
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Here gammq is our routine for the incomplete gamma function Q(a,x), §6.2. If 
Q is larger than, say, 0.1, then the goodness-of-fit is believable. If it is larger 
than, say, 0.001, then the fit may be acceptable if the errors are nonnormal or have 
been moderately underestimated. If Q is less than 0.001 then the model and/or 
estimation procedure can rightly be called into question. In this latter case, turn 
to §15.7 to proceed further. 

If you do not know the individual measurement errors of the points a ,, and are 
proceeding (dangerously) to use equation (15.1.6) for estimating these errors, then 
here is the procedure for estimating the probable uncertainties of the parameters a 
and b: Set cr, l in all equations through (15.2.6), and multiply a a and Ob, as 
obtained from equation (15.2.9), by the additional factor ~ 2), where % 2 

is computed by (15.2.2) using the fitted parameters a and b. As discussed above, 
this procedure is equivalent to assuming a good fit, so you get no independent 
goodness-of-fit probability Q. 

In §14.5 we promised a relation between the linear correlation coefficient 
r (equation 14.5.1) and a goodness-of-fit measure, % 2 (equation 15.2.2). For 
unweighted data (all cy = 1), that relation is 


X 2 = (l-r 2 )NVar( 2 / 1 ... 2 / JV ) (15.2.13) 

where 

N 

NVar( 2/l ... 2/JV ) = ^(y i -y) 2 (15.2.14) 

i= 1 

For data with varying weights tr*, the above equations remain valid if the sums in 
equation (14.5.1) are weighted by 1/cr 2 . 


The following function, fit, carries out exactly the operations that we have 
discussed. When the weights a are known in advance, the calculations exactly 
correspond to the formulas above. However, when weights a are unavailable, 
the routine assumes equal values of a for each point and assumes a good fit, as 
discussed in §15.1. 

The formulas (15.2.6) are susceptible to roundoff error. Accordingly, we 
rewrite them as follows: Define 

U = -j- (xi- ; i = 1,2,... ,N (15.2.15) 

and 

N 

S tt = Y, t2 i (15.2.16) 

i=1 

Then, as you can verify by direct substitution, 


b = 


i N + 

1 ^iVi 


Sy ~ S x b 

s 


(15.2.17) 



a = 


(15.2.18) 
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_sl_\ 

ss tt ) 

(15.2.19) 

2 1 
a b ~ ~n~ 


(15.2.20) 

S t t 



Co v(a,b) = 


(15.2.21) 

Cov(a, b 

r ab = 


(15.2.22) 


#include <math.h> 

#include "nrutil.h" 

void fit(float x[], float y[], int ndata, float sig[] , int mwt, float *a, 
float *b, float *siga, float *sigb, float *chi2, float *q) 

Given a set of data points x[l. .ndata] ,y[l. .ndata] with individual standard deviations 
sig[l. .ndata] , fit them to a straight line y = a + bx by minimizing x 2 - Returned are 
a,b and their respective probable uncertainties siga and sigb, the chi-square chi2, and the 
goodness-of-fit probability q (that the fit would have x 2 this large or larger). If mwt=0 on 
input, then the standard deviations are assumed to be unavailable: q is returned as 1.0 and 
the normalization of chi2 is to unit standard deviation on all points. 

{ 

float gammqffloat a, float x); 
int i; 

float wt,t,sxoss,sx=0.0,sy=0.0,st2=0.0,ss,sigdat; 

*b=0.0; 
if (mwt) { 

ss=0.0; 

for (i=l;i<=ndata;i++) { 
wt=1.0/SC)R(sig[i]) ; 
ss += wt; 
sx += x [i] *wt; 
sy += y [i] *wt; 

> 

> else { 

for (i=l;i<=ndata;i++) { 
sx += x[i] ; 
sy += y[i] ; 

> 

ss=ndata; 

> 

sxoss=sx/ss; 
if (mwt) { 

for (i=l;i<=ndata;i++) { 
t=(x[i]-sxoss)/sig [i] 
st2 += t*t; 

*b += t*y [i] /sig[i] ; 

> 

> else { 

for (i=l;i<=ndata;i++) { 
t=x [i]-sxoss; 
st2 += t*t; 

*b += t*y[i] ; 

> 

> 

*b /= st2; Solve for a, b, a a , and a),. 

*a= (sy-sx* (*b)) /ss; 

*siga=sqrt((1.0+sx*sx/(ss*st2))/ss); 

*sigb=sqrt(1.0/st2); 


Accumulate sums ... 
...with weights 

...or without weights. 
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> 


*chi2=0.0; 

*q=l.0; 

if (mwt == 0) { 

for (i=l;i<=ndata;i++) 

*chi2 += SQR(y[i]-(*a)-(*b)*x[i] ) ; 
sigdat=sqrt((*chi2)/(ndata-2)); 

*siga *= sigdat; 

*sigb *= sigdat; 

> else { 

for (i=l;i<=ndata;i++) 


Calculate x 2 - 


For unweighted data evaluate typ¬ 
ical sig using chi2, and ad¬ 
just the standard deviations. 


*chi2 += SQR((y [i] -(*a)-(*b)*x[i] )/sig[i]); 
if (ndata>2) *q=gammq(0.5*(ndata-2) ,0.5*(*chi2)); Equation (15.2.12). 

> 


CITED REFERENCES AND FURTHER READING: 

Bevington, RR. 1969, Data Reduction and Error Analysis for the Physical Sciences (New York: 
McGraw-Hill), Chapter 6. 


15.3 Straight-Line Data with Errors in Both 
Coordinates 


If experimental data are subject to measurement error not only in the y,.’s, but also in 
the Xi’s, then the task of fitting a straight-line model 

y(x) = a + bx (15.3.1) 

is considerably harder. It is straightforward to write down the \ 2 merit function for this case. 


X 2 {a,b) = 

i= 1 


(yi-a - bxj) 2 

|f§ + 1)2a J 1 


(15.3.2) 


where o x % and a y t are, respectively, the x and y standard deviations for the ith point. The 
weighted sum of variances in the denominator of equation (15.3.2) can be understood both 
as the variance in the direction of the smallest \ 2 between each data point and the line with 
slope b, and also as the variance of the linear combination yi — a — bxt of two random 
variables Xi and yi, 

Var(r/j — a — bxi) = Var(i/i) + b 2 Var(a:i) = cr 2 j + b 2 a xi = l/wi (15.3.3) 

The sum of the square of N random variables, each normalized by its variance, is thus 
X 2 -distributed. 

We want to minimize equation (15.3.2) with respect to a and b. Unfortunately, the 
occurrence of b in the denominator of equation (15.3.2) makes the resulting equation for 
the slope d\ 2 /db = 0 nonlinear. However, the corresponding condition for the intercept, 
dx 2 /da = 0, is still linear and yields 


'Y^w i {y i 



(15.3.4) 


where the Wi’s are defined by equation (15.3.3). A reasonable strategy, now, is to use the 
machinery of Chapter 10 (e.g., the routine brent) for minimizing a general one-dimensional 
function to minimize with respect to b, while using equation (15.3.4) at each stage to ensure 
that the minimum with respect to b is also minimized with respect to a. 
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b 



Figure 15.3.1. Standard errors for the parameters a and b. The point B can be found by varying the 
slope b while simultaneously minimizing the intercept a. This gives the standard error ct, and also the 
value s. The standard error a a can then be found by the geometric relation c% = g 2 + r 2 . 




Because of the finite error bars on the Xi’s, the minimum \ 2 as a function of b will 
be finite, though usually large, when b equals infinity (line of infinite slope). The angle 
6 = arctan b is thus more suitable as a parametrization of slope than b itself. The value of x 2 
will then be periodic in 6 with period 7t (not 2n\). If any data points have very small a y ’s 
but moderate or large <j x ’s, then it is also possible to have a maximum in x 2 near zero slope, 
9 w 0. In that case, there can sometimes be two x 2 minima, one at positive slope and the 
other at negative. Only one of these is the correct global minimum. It is therefore important 
to have a good starting guess for b (or 6). Our strategy, implemented below, is to scale the 
yi’s so as to have variance equal to the .x,\s, then to do a conventional (as in §15.2) linear fit 
with weights derived from the (scaled) sum cr 2 1 + a xi . This yields a good starting guess for 
b if the data are even plausibly related to a straight-line model. 

Finding the standard errors a a and cp, on the parameters a and b is more complicated. 
We will see in §15.6 that, in appropriate circumstances, the standard errors in a and b are the 
respective projections onto the a and b axes of the “confidence region boundary” where x 2 
takes on a value one greater than its minimum. Ax 2 = 1. In the linear case of §15.2, these 
projections follow from the Taylor series expansion 


Because of the present nonlinearity in b, however, analytic formulas for the second derivatives 
are quite unwieldy; more important, the lowest-order term frequently gives a poor approxima¬ 
tion to Ax 2 . Our strategy is therefore to find the roots of Ax 2 = 1 numerically, by adjusting 
the value of the slope b away from the minimum. In the program below the general root finder 
zbrent is used. It may occur that there are no roots at all — for example, if all error bars are 
so large that all the data points are compatible with each other. It is important, therefore, to 
make some effort at bracketing a putative root before refining it (cf. §9.1). 

Because a is minimized at each stage of varying b, successful numerical root-finding 
leads to a value of Aa that minimizes x 2 for the value of A b that gives Ax 2 = 1. This (see 
Figure 15.3.1) directly gives the tangent projection of the confidence region onto the b axis, 
and thus cp,. It does not, however, give the tangent projection of the confidence region onto 
the a axis. In the figure, we have found the point labeled B; to find a a we need to find the 
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point A. Geometry to the rescue: To the extent that the confidence region is approximated 
by an ellipse, then you can prove (see figure) that a 2 = r 2 + s 2 . The value of s is known 
from having found the point B. The value of r follows from equations (15.3.2) and (15.3.3) 
applied at the x 2 minimum (point O in the figure), giving 



Actually, since b can go through infinity, this whole procedure makes more sense in 
(a, 9) space than in (a, b) space. That is in fact how the following program works. Since 
it is conventional, however, to return standard errors for a and b, not a and 9, we finally 
use the relation 

<?b = <re / cos 2 9 (15.3.7) 

We caution that if b and its standard error are both large, so that the confidence region actually 
includes infinite slope, then the standard error cq, is not very meaningful. The function chixy 
is normally called only by the routine f itexy. However, if you want, you can yourself explore 
the confidence region by making repeated calls to chixy (whose argument is an angle 9, not 
a slope b), after a single initializing call to f itexy. 

A final caution, repeated from §15.0, is that if the goodness-of-fit is not acceptable 
(returned probability is too small), the standard errors a a and at, are surely not believable. In 
dire circumstances, you might try scaling all your x and y error bars by a constant factor until 
the probability is acceptable (0.5, say), to get more plausible values for a a and ah. 


#include <math.h> 

#include "nrutil.h" 

#define POTN 1.571000 
#define BIG 1.0e30 
#define PI 3.14159265 
#define ACC 1.0e-3 

int nn; Global variables communicate with 

float *xx,*yy,*sx,*sy,*ww,aa,offs; chixy. 

void fitexy(float x[], float y[], int ndat, float sigx[] , float sigy[] , 
float *a, float *b, float *siga, float *sigb, float *chi2, float *q) 
Straight-line fit to input data x [1 . .ndat] and y [1. .ndat] with errors in both x and y, the re¬ 
spective standard deviations being the input quantities sigx [1 . .ndat] and sigy [1 . .ndat] . 
Output quantities are a and b such that y = a + bx minimizes x 2 . whose value is returned 
as chi2. The x 2 probability is returned as q, a small value indicating a poor fit (sometimes 
indicating underestimated errors). Standard errors on a and b are returned as siga and sigb. 
These are not meaningful if either (i) the fit is poor, or (ii) b is so large that the data are 
consistent with a vertical (infinite b) line. If siga and sigb are returned as BIG, then the data 
are consistent with all values of b. 

{ 

void avevar(float data[], unsigned long n, float *ave, float *var); 
float brent(float ax, float bx, float cx, 

float (*f)(float), float tol, float *xmin); 
float chixy(float bang); 

void fit(float x[], float y[], int ndata, float sig[] , int mwt, 

float *a, float *b, float *siga, float *sigb, float *chi2, float *q) ; 
float gammq(float a, float x); 

void mnbrak(float *ax, float *bx, float *cx, float *fa, float *fb, 
float *fc, float (*func)(float)); 

float zbrent(float (*func)(float), float xl, float x2, float tol); 
int j; 

float swap,amx,amn,varx,vary,ang[7],ch[7],scale,bmn,bmx,dl,d2,r2, 
duml,dum2,dum3,dum4,dum5; 

xx=vector(1,ndat); 
yy=vector(1,ndat); 
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sx=vector(l,ndat); 
sy=vector(l,ndat); 
ww=vector(l,ndat); 
avevar(x,ndat,&duml,&varx); 
avevar(y,ndat,&duml,&vary); 
scale=sqrt(varx/vary); 
nn=ndat; 

for (j=l;j<=ndat;j++) { 
xx[j]=x[j] ; 
yy[j]=y[j]*s c ale; 
sx[j]=sigx[j] ; 
sy[j]=sigy[j]*scale; 
ww[j]=sqrt(SQR(sx[j])+SQR(sy [j])) 

> 


Find the x and y variances, and scale 
the data into the global variables 
for communication with the func- 


:[j] )+SQR(sy [j] )); Use both x and y weights in first 

trial fit. 

inil,b ) &dum2 ) &dum3,&dum4,&dum5); Trial fit for b. 

Construct several angles for refer¬ 
ence points, and make b an an¬ 
gle. 


offs=ang[l]=0.0; Construe 

ang[2]=atan(*b); ence 

ang [4] =0.0; gle. 

ang [5] =ang [2] ; 
ang [6] =P0TN; 

for (j=4; j<=6; j++) ch[j]=chixy(ang[j]); 
mnbrak(&ang[l] ,&ang[2] ,&ang[3] ,&ch[l] ,&ch[2] ,&ch[3] ,c 
Bracket the x 2 minimum and then locate it with brent. 
*chi2=brent(ang[l],ang[2],ang[3],chixy,ACC,b); 
*chi2=chixy(*b); 


*q=gammq(0.5*(nn-2) ,*chi2*0.5) ; 
for (r2=0.0,j=l;j<=nn;j++) r2 += v 
r2=l.0/r2; 
bmx=BIG; 
bmn=BIG; 

offs=(*chi2)+l.0; 
for (j=l;j<=6;j++) { 
if (ch[j] > offs) { 

dl=fabs(ang[j]-(*b)) ; 
while (dl >= PI) dl -= PI; 
d2=PI-dl; 

if (ang[j] < *b) { 
swap=dl; 
dl=d2; 
d2=swap; 

> 

if (dl < bmx) bmx=dl; 
if (d2 < bmn) bmn=d2; 


Compute x 2 probability. 

Save the inverse sum of weights at 
the minimum. 

Now, find standard errors for b as 
points where Ax 2 = 1. 

Go through saved values to bracket 
the desired roots. Note period¬ 
icity in slope angles. 


Call zbrent to find the roots. 


if (bmx < BIG) { Call zbrent to f 

bmx=zbrent(chixy,*b,*b+bmx,ACC)-(*b); 
amx=aa-(*a); 

bmn=zbrent(chixy,*b,*b-bmn,ACC)-(*b); 
amn=aa-(*a); 

*sigb=sqrt(0.5*(bmx*bmx+bmn*bmn))/(scale*SQR(cos(*b))); 
*siga=sqrt(0.5*(amx*amx+amn*amn)+r2)/scale; Error in 

} else (*sigb)=(*siga)=BIG; r2. 


n a has additional piece 


*a /= scale; 
*b=tan(*b)/sca: 
free_vector(ww 
free_vector(sy 
free_vector(sx 
free_vector(yy 
free_vector(xx 


Unscale the answers. 
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#include <math.h> 

#include "nrutil.h" 

#define BIG 1.0e30 

extern int nn; 

extern float *xx,*yy,*sx,*sy,*ww,aa,offs; 
float chixy(float bang) 

Captive function of fitexy, returns the value of (x 2 — offs) for the slope b=tan(bang). 
Scaled data and offs are communicated via the global variables. 

{ 

int j; 

float ans,avex=0.0,avey=0.0,sumw=0.0,b; 

b=tan(bang); 

for (j=l;j<=nn;j++) { 

ww[j] = SC)R(b*sx[j])+SQR(sy[j]); 

sums += (ww[j] = (ww[j] < 1.0/BIG ? BIG : 1.0/ww[j])); 
avex += ww[j]*xx[j]; 
avey += ww [ j ] *yy [ j ] ; 

> 

avex /= sumw; 
avey /= sumw; 
aa=avey-b*avex; 

for (ans = -offs,j=l;j<=nn;j++) 

ans += ww[j]*SQR(yy[j]-aa-b*xx[j]) ; 
return ans; 

> 


Be aware that the literature on the seemingly straightforward subject of this section 
is generally confusing and sometimes plain wrong. Deming’s [1 ] early treatment is sound, 
but its reliance on Taylor expansions gives inaccurate error estimates. References [2-4] are 
reliable, more recent, general treatments with critiques of earlier work. York [5] and Reed [6] 
usefully discuss the simple case of a straight line as treated here, but the latter paper has 
some errors, corrected in [7], All this commotion has attracted the Bayesians [8-10], who 
have still different points of view. 
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York, D. 1966, Canadian Journal of Physics, vol. 44, pp. 1079-1086. [5] 
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15.4 General Linear Least Squares 

An immediate generalization of §15.2 is to fit a set of data points (a;,, y,) to a 
model that is not just a linear combination of 1 and x (namely a + bx), but rather a 
linear combination of any M specified functions of x. For example, the functions 
could be 1, x, x 2 ,..., x M ~ x , in which case their general linear combination, 

y(x) = ai + a, 2 X + a^x 2 + ■ ■ ■ + aMX M_1 (15.4.1) 

is a polynomial of degree M — 1. Or, the functions could be sines and cosines, in 
which case their general linear combination is a harmonic series. 

The general form of this kind of model is 

M 

y(x) = ^2a k X k ( x) (15.4.2) 

fc=l 

where X\(x), .... Xm{x) are arbitrary fixed functions of x, called the basis 
functions. 

Note that the functions X k (x) can be wildly nonlinear functions of x. In this 
discussion “linear” refers only to the model’s dependence on its parameters a k . 

For these linear models we generalize the discussion of the previous section 
by defining a merit function 

N 

x 2 = £ 

i=l 

As before, ct* is the measurement error (standard deviation) of the ith data point, 
presumed to be known. If the measurement errors are not known, they may all (as 
discussed at the end of §15.1) be set to the constant value a = 1. 

Once again, we will pick as best parameters those that minimize \ 2 - There are 
several different techniques available for finding this minimum. Two are particularly 
useful, and we will discuss both in this section. To introduce them and elucidate 
their relationship, we need some notation. 

Let A be a matrix whose N x M components are constructed from the M 
basis functions evaluated at the N abscissas x t , and from the N measurement errors 
ai, by the prescription 

A V = Sfei J15.4.4) 

ai 

The matrix A is called the design matrix of the fitting problem. Notice that in general 
A has more rows than columns, N >M, since there must be more data points than 
model parameters to be solved for. (You can fit a straight line to two points, but not a 
very meaningful quintic!) The design matrix is shown schematically in Figure 15.4.1. 
Also define a vector b of length N by 

bi = — (15.4.5) 

(Ji 

and denote the M vector whose components are the parameters to be fitted, 
ai,..., aM , by a. 


Vi Sfc=l a kXk(Xj) 


(15.4.3) 



s o- i 
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Figure 15.4.1. Design matrix for the least-squares fit of a linear combination of M basis functions to N 
data points. The matrix elements involve the basis functions evaluated at the values of the independent 
variable at which measurements are made, and the standard deviations of the measured dependent variable. 
The measured values of the dependent variable do not enter the design matrix. 

Solution by Use of the Normal Equations 

The minimum of (15.4.3) occurs where the derivative of x 2 with respect to all 
M parameters a k vanishes. Specializing equation (15.1.7) to the case of the model 
(15.4.2), this condition yields the M equations 


0 = ]T^ m -^djXjixi) Xk{ Xi ) k = 1,... ,M (15.4.6) 

i=1 ^ 3=1 


Interchanging the order of summations, we can write (15.4.6) as the matrix equation 

M 

^2a kj a j =/3 k (15.4.7) 

3=1 

where 

a k j = Xj( x *)Xk( x t) Qr equivalently ^ (15.4.8) 

an M x M matrix, and 


or equivalently [a]=A T -A (15.4.8) 


(3k = ^2 or equivalently [/3] = A r • b (15.4.9) 
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a vector of length M. 

The equations (15.4.6) or (15.4.7) are called the normal equations of the least- 
squares problem. They can be solved for the vector of parameters a by the standard 
methods of Chapter 2, notably LU decomposition and backsubstitution, Choleksy 
decomposition, or Gauss-Jordan elimination. In matrix form, the normal equations 
can be written as either 


[a] • a = [/3] or as (A t • A) • a = A T • b (15.4.10) 

The inverse matrix Cjk = [ a ]jk ' s closely related to the probable (or, more 
precisely, standard ) uncertainties of the estimated parameters a. To estimate these 
uncertainties, consider that 


> = D< 


ViX k (x 


(15.4.11) 


and that the variance associated with the estimate a 3 can be found as in (15.2.7) from 



Note that a 3k is independent of y t , so that 

„ M 

k= 1 

Consequently, we find that 

M M 

4i) = EE% 

k= 1 1=1 


X k (x 


•.)*«(*<) 


(15.4.12) 


(15.4.13) 


(15.4.14) 


The final term in brackets is just the matrix [a]. Since this is the matrix inverse of 
[C], (15.4.14) reduces immediately to 

a 2 ( aj ) = Cjj (15.4.15) 


In other words, the diagonal elements of [ C] are the variances (squared 
uncertainties) of the fitted parameters a. It should not surprise you to learn that the 
off-diagonal elements Cj k are the covariances between o :/ and a k (cf. 15.2.10); but 
we shall defer discussion of these to §15.6. 

We will now give a routine that implements the above formulas for the general 
linear least-squares problem, by the method of normal equations. Since we wish to 
compute not only the solution vector a but also the covariance matrix [C\, it is most 
convenient to use Gauss-Jordan elimination (routine gaussj of §2.1) to perform the 
linear algebra. The operation count, in this application, is no larger than that for LU 
decomposition. If you have no need for the covariance matrix, however, you can 
save a factor of 3 on the linear algebra by switching to LU decomposition, without 
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computation of the matrix inverse. In theory, since A T • A is positive definite, 
Cholesky decomposition is the most efficient way to solve the normal equations. 
However, in practice most of the computing time is spent in looping over the data 
to form the equations, and Gauss-Jordan is quite adequate. 

We need to warn you that the solution of a least-squares problem directly from 
the normal equations is rather susceptible to roundoff error. An alternative, and 
preferred, technique involves QR decomposition (§2.10, §11.3, and §11.6) of the 
design matrix A. This is essentially what we did at the end of § 15.2 for fitting data to 
a straight line, but without invoking all the machinery of QR to derive the necessary 
formulas. Later in this section, we will discuss other difficulties in the least-squares 
problem, for which the cure is singular value decomposition (S VD), of which we give 
an implementation. It turns out that S VD also fixes the roundoff problem, so it is our 
recommended technique for all but “easy” least-squares problems. It is for these easy 
problems that the following routine, which solves the normal equations, is intended. 

The routine below introduces one bookkeeping trick that is quite useful in 
practical work. Frequently it is a matter of “art” to decide which parameters a k 
in a model should be fit from the data set, and which should be held constant at 
fixed values, for example values predicted by a theory or measured in a previous 
experiment. One wants, therefore, to have a convenient means for “freezing” 
and “unfreezing” the parameters a k ■ In the following routine the total number of 
parameters Ofc is denoted ma (called M above). As input to the routine, you supply 
an array ia[l. .ma], whose components are either zero or nonzero (e.g., 1). Zeros 
indicate that you want the corresponding elements of the parameter vector a [1. . ma] 
to be held fixed at their input values. Nonzeros indicate parameters that should be 
fitted for. On output, any frozen parameters will have their variances, and all their 
covariances, set to zero in the covariance matrix. 


#include "nrutil.h" 

void lfit(float x[], float y[], float sig[], int ndat, float a[] , int ia[] , 
int ma, float **covar, float *chisq, void (*funcs)(float, float [], int)) 
Given a set of data points x[l..ndat], y[l..ndat] with individual standard deviations 
sig[l. .ndat], use x 2 minimization to fit for some or all of the coefficients a[l. .ma] of 
a function that depends linearly on a, y = JT aj x afunCj(z:). The input array ia[l. .ma] 
indicates by nonzero entries those components of a that should be fitted for, and by zero entries 
those components that should be held fixed at their input values. The program returns values 
for a[l. .ma] , x 2 = chisq, and the covariance matrix covar [1. .ma] [1 . .ma] . (Parameters 
held fixed will return zero covariances.) The user supplies a routine funcs(x,afunc,ma) that 
returns the ma basis functions evaluated at x = x in the array afunc[l. .ma] . 

{ 

void covsrt(float **covar, int ma, int ia[], int mfit); 

void gaussj(float **a, int n, float **b, int m); 

int i,j,k,l,m,mfit=0; 

float ym,wt,sum,sig2i,**beta,*afunc; 

beta=matrix(l,ma,l,l); 
afunc=vector(l,ma); 
for (j=l;j<=ma;j++) 
if (ia[j]) mfit++; 

if (mfit == 0) nrerrorC'lfit: no parameters to be fitted"); 
for (j=l; j<=mfit; j++) { Initialize the (symmetric) matrix, 

for (k=l;k<=mfit;k++) covar[j] [k]=0.0; 
beta[j] [1]=0.0; 

> 

for (i=l; i<=ndat; i++) { Loop over data to accumulate coefficients of 

the normal equations. 
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> 


(*funcs)(x[i],afunc,ma); 
ym=y [i]; 

if (mfit < ma) { Subtract off dependences on known pieces 

for (j=l; j<=ma; j++) of the fitting function, 

if (! ia[j] ) ym -= a[j] *afunc [j] ; 

> 

sig2i=l.0/SQR(sig[i]); 
for (j=0,l=l;l<=ma;l++) { 
if (ia[l]) { 

wt=afunc[1]*sig2i; 

for (j++,k=0,m=l;m<=l;m++) 

if (ia[m]) covar [j] [++k] += wt*afunc [m] ; 
beta[j] [1] += ym*wt; 

> 

> 

> 

for (j=2;j<=mfit;j++) 
for (k=l;k<j;k++) 

covar[k] [j]=covar[j] [k] ; 
gaussj(covar,mfit,beta,1); 
for (j=0,l=l;l<=ma;l++) 

if (ia[l] ) a [1] =beta [++j ] [1] ; 

*chisq=0.0; 

for (i=l;i<=ndat;i++) { 

(*funcs)(x[i],afunc,ma); 
for (sum=0.0,j=l;j<=ma;j++) sum += a[j] *afunc [j] ; 

*chisq += SQR((y[i]-sum)/sig[i]) ; 

> 

covsrt (covar ,ma,ia, mfit) ; Sort covariance matrix to true order of fitting 

free_vector(afunc, 1 ,ma); coefficients. 

free_matrix(beta,l,ma,l,l); 


Fill in above the diagonal from symmetry. 

Matrix solution. 

Partition solution to appropriate coefficients 
a. 

Evaluate x 2 of the fit. 


That last call to a function covsrt is only for the purpose of spreading the 
covariances back into the full ma x ma covariance matrix, in the proper rows and 
columns and with zero variances and covariances set for variables which were 
held frozen. 

The function covsrt is as follows. 


#define SWAP(a,b) {swap=(a);(a)=(b);(b)=swap;> 

void covsrtffloat **covar, int ma, int ia[], int mfit) 

Expand in storage the covariance matrix covar, so as to take into account parameters that are 
being held fixed. (For the latter, return zero covariances.) 

{ 

int i,j,k; 
float swap; 

for (i=mfit+l;i<=ma;i++) 

for (j=l; j<=i; j'++) covar[i] [j] =covar[j] [i] =0.0; 
k=mfit; 

for (j=ma; j>=l; j—) { 
if (ia[j]) { 

for (i=l;i<=ma;i++) SWAP(covar[i][k],covar[i] [j]) 
for (i=l;i<=ma;i++) SWAP(covar[k][i],covar[j][i]) 
k—; 

> 

> 

> 
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Solution by Use of Singular Value Decomposition 

In some applications, the normal equations are perfectly adequate for linear 
least-squares problems. However, in many cases the normal equations are very close 
to singular. A zero pivot element may be encountered during the solution of the 
linear equations (e.g., in gaussj), in which case you get no solution at all. Or a 
very small pivot may occur, in which case you typically get fitted parameters a k 
with very large magnitudes that are delicately (and unstably) balanced to cancel out 
almost precisely when the fitted function is evaluated. 

Why does this commonly occur? The reason is that, more often than experi¬ 
menters would like to admit, data do not clearly distinguish between two or more of 
the basis functions provided. If two such functions, or two different combinations 
of functions, happen to fit the data about equally well — or equally badly — then 
the matrix [a], unable to distinguish between them, neatly folds up its tent and 
becomes singular. There is a certain mathematical irony in the fact that least-squares 
problems are both overdetermined (number of data points greater than number of 
parameters) and underdetermined (ambiguous combinations of parameters exist); 
but that is how it frequently is. The ambiguities can be extremely hard to notice 
a priori in complicated problems. 

Enter singular value decomposition (SVD). This would be a good time for you 
to review the material in §2.6, which we will not repeat here. In the case of an 
overdetermined system, SVD produces a solution that is the best approximation in 
the least-squares sense, cf. equation (2.6.10). That is exactly what we want. In the 
case of an underdetermined system, SVD produces a solution whose values (for us, 
the cifc’s) are smallest in the least-squares sense, cf. equation (2.6.8). That is also 
what we want: When some combination of basis functions is irrelevant to the fit, that 
combination will be driven down to a small, innocuous, value, rather than pushed 
up to delicately canceling infinities. 

In terms of the design matrix A (equation 15.4.4) and the vector b (equation 
15.4.5), minimization of y 2 in (15.4.3) can be written as 

find a that minimizes x 2 = |A • a — b | 2 (15.4.16) 

Comparing to equation (2.6.9), we see that this is precisely the problem that routines 
svdcmp and svbksb are designed to solve. The solution, which is given by equation 
(2.6.12), can be rewritten as follows: If U and V enter the SVD decomposition 
of A according to equation (2.6.1), as computed by svdcmp, then let the vectors 
U(j) i = M denote the columns of U (each one a vector of length N ); and 

let the vectors V(j); i = 1,..., M denote the columns of V (each one a vector 
of length M). Then the solution (2.6.12) of the least-squares problem (15.4.16) 
can be written as 


a = 


E 


(15.4.17) 


where the Wi are, as in §2.6, the singular values calculated by svdcmp. 

Equation (15.4.17) says that the fitted parameters a are linear combinations of 
the columns of V, with coefficients obtained by forming dot products of the columns 
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of U with the weighted data vector (15.4.5). Though it is beyond our scope to prove 
here, it turns out that the standard (loosely, “probable”) errors in the fitted parameters 
are also linear combinations of the columns of V. In fact, equation (15.4.17) can 
be written in a form displaying these errors as 



—V (M) (15.4.18) 
wm 


Here each ± is followed by a standard deviation. The amazing fact is that, 
decomposed in this fashion, the standard deviations are all mutually independent 
(uncorrelated). Therefore they can be added together in root-mean-square fashion. 
What is going on is that the vectors V ^ are the principal axes of the error ellipsoid 
of the fitted parameters a (see §15.6). 

It follows that the variance in the estimate of a parameter a j is given by 



whose result should be identical with (15.4.14). As before, you should not be 
surprised at the formula for the covariances, here given without proof, 

Cov(o,-,o fc | = Y, (^P| (15.4.20) 

We introduced this subsection by noting that the normal equations can fail 
by encountering a zero pivot. We have not yet, however, mentioned how SVD 
overcomes this problem. The answer is: If any singular value Wi is zero, its 
reciprocal in equation (15.4.18) should be set to zero, not infinity. (Compare the 
discussion preceding equation 2.6.7.) This corresponds to adding to the fitted 
parameters a a zero multiple, rather than some random large multiple, of any linear 
combination of basis functions that are degenerate in the fit. It is a good thing to do! 

Moreover, if a singular value Wi is nonzero but very small, you should also 
define its reciprocal to be zero, since its apparent value is probably an artifact of 
roundoff error, not a meaningful number. A plausible answer to the question “how 
small is small?” is to edit in this fashion all singular values whose ratio to the 
largest singular value is less than N times the machine precision e. (You might 
argue for y/N, or a constant, instead of N as the multiple; that starts getting into 
hardware-dependent questions.) 

There is another reason for editing even additional singular values, ones large 
enough that roundoff error is not a question. Singular value decomposition allows 
you to identify linear combinations of variables that just happen not to contribute 
much to reducing the \ 2 °f your data set. Editing these can sometimes reduce the 
probable error on your coefficients quite significantly, while increasing the minimum 
X 2 only negligibly. We will learn more about identifying and treating such cases 
in §15.6. In the following routine, the point at which this kind of editing would 
occur is indicated. 

Generally speaking, we recommend that you always use SVD techniques instead 
of using the normal equations. SVD’s only significant disadvantage is that it requires 
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an extra array of size N x M to store the whole design matrix. This storage 
is overwritten by the matrix U. Storage is also required for the M x M matrix 
V, but this is instead of the same-sized matrix for the coefficients of the normal 
equations. SVD can be significantly slower than solving the normal equations; 
however, its great advantage, that it (theoretically) cannot fail, more than makes up 
for the speed disadvantage. 

In the routine that follows, the matrices u,v and the vector w are input as 
working space. The logical dimensions of the problem are ndata data points by ma 
basis functions (and fitted parameters). If you care only about the values a of the 
fitted parameters, then u, v, w contain no useful information on output. If you want 
probable errors for the fitted parameters, read on. 

#include "nrutil.h" 

#define TOL 1.0e-5 Default value for single precision and vari¬ 

ables scaled to order unity. 

void svdfit(float x[], float y[], float sig[], int ndata, float a[] , int ma, 
float **u, float **v, float w[], float *chisq, 
void (*funcs)(float, float [], int)) 

Given a set of data points x[l. .ndata] ,y[l. .ndata] with individual standard deviations 
sig[l. .ndata] , use x 2 minimization to determine the coefficients a[l..ma] of the fit¬ 
ting function y = a* x afuncj(a:). Here we solve the fitting equations using singular 
value decomposition of the ndata by ma matrix, as in §2.6. Arrays u[l. .ndata] [1. .ma] , 
v[l. .ma] [1. .ma] , and w[l. .ma] provide workspace on input; on output they define the 
singular value decomposition, and can be used to obtain the covariance matrix. The pro¬ 
gram returns values for the ma fit parameters a, and x 2 . chisq. The user supplies a routine 
funcs(x,afunc,ma) that returns the ma basis functions evaluated at x = x in the array 
afunc [1. .ma]. 

{ 

void svbksb(float **u, float w[] , float **v, int m, int n, float b[], 
float x [] ); 

void svdcmp(float **a, int m, int n, float w[], float **v) ; 
int j,i; 

float wmax,tmp,thresh,sum,*b,*afunc; 

b=vector(l,ndata); 
afunc=vector(l,ma); 

for (i=l;i<=ndata;i++) { Accumulate coefficients of the fitting ma- 

(*funcs) (x[i] ,afunc,ma) ; trix. 

tmp=l .0/sig[i] ; 

for (j=l;j<=ma;j++) u[i] [j] =afunc[j] *tmp; 
b[i]=y [i] *tmp; 

> 

svdcmp(u,ndata,ma,w,v) ; Singular value decomposition. 

wmax=0.0; Edit the singular values, given TOL from the 

for (j=l;j<=ma;j++) #define statement, between here ... 

if (w[j] > wmax) wmax=w[j]; 
thresh=TOL*wmax; 
for (j=l;j<=ma;j++) 

if (w[j] < thresh) w[j]=0.0; ...and here. 

svbksb(u,w,v,ndata,ma,b,a); 

*chisq=0.0; Evaluate chi-square, 

for (i=l;i<=ndata;i++) { 

(*funcs)(x[i],afunc,ma); 

for (sum=0.0,j=i;j<=ma;j++) sum += a[j]*afunc[j]; 

♦chisq += (tmp=(y[i]-sum)/sig[i],tmp*tmp); 

> 

free_vector(afunc,1,ma); 
free_vector(b,l,ndata); 
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Feeding the matrix v and vector w output by the above program into the 
following short routine, you easily obtain variances and covariances of the fitted 
parameters a. The square roots of the variances are the standard deviations of 
the fitted parameters. The routine straightforwardly implements equation (15.4.20) 
above, with the convention that singular values equal to zero are recognized as 
having been edited out of the fit. 

#include "nrutil.h" 

void svdvar(float **v, int ma, float w[], float **cvm) 

To evaluate the covariance matrix cvm[l. .ma] [1. .ma] of the fit for ma parameters obtained 
by svdfit, call this routine with matrices v[l. .ma] [1. .ma] , w[l. .ma] as returned from 
svdfit. 

{ 

int k,j,i; 
float sum,*wti; 

wti=vector(l,ma); 
for (i=l;i<=ma;i++) { 
wti[i]=0.0; 

if (w[i]) wti [i] =1.0/(w[i] *w[i] ) ; 

> 

for (i=l; i<=ma; i++) { Sum contributions to covariance matrix (15.4.20). 

for (j=l;j<=i;j++) { 

for (sum=0.0,k=l;k<=ma;k++) sum += v[i] [k] *v[j] [k] *wti [k] ; 
cvm[j] [i]=cvm[i] [j]=sum; 

> 

> 

free_vector(wti,1,ma); 


Examples 

Be aware that some apparently nonlinear problems can be expressed so that 
they are linear. For example, an exponential model with two parameters a and b, 

y(x) = aexp(—bx) (15.4.21) 

can be rewritten as 

log[t/(a;)] = c—bx (15.4.22) 

which is linear in its parameters c and b. (Of course you must be aware that such 
transformations do not exactly take Gaussian errors into Gaussian errors.) 

Also watch out for “non-parameters,” as in 

y(x) = a exp (—bx + d ) (15.4.23) 

Here the parameters a and d are, in fact, indistinguishable. This is a good example of 
where the normal equations will be exactly singular, and where SVD will find a zero 
singular value. SVD will then make a “least-squares” choice for setting a balance 
between a and d (or, rather, their equivalents in the linear model derived by taking the 
logarithms). However — and this is true whenever SVD gives back a zero singular 
value — you are better advised to figure out analytically where the degeneracy is 
among your basis functions, and then make appropriate deletions in the basis set. 

Here are two examples for user-supplied routines f uncs. The first one is trivial 
and fits a general polynomial to a set of data: 
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void fpoly(float x, float p[], int np) 

Fitting routine for a polynomial of degree np-1, with coefficients in the array p[l. .np] . 
{ 

int j; 
p[l]=1.0; 

for (j=2;j<=np;j++) p[j] =p[j-1] *x; 


The second example is slightly less trivial. It is used to fit Legendre polynomials 
up to some order nl-1 through a data set. 

void fleg(float x, float pl[], int nl) 

Fitting routine for an expansion with nl Legendre polynomials pi, evaluated using the recurrence 
relation as in §5.5. 

{ 

int j; 

float twox,f2,fl,d; 

P 1[1]=1.0; 

P 1 [2] =x; 
if (nl > 2) { 
twox=2.0*x; 
f2=x; 
d=l.0; 

for (j=3;j<=nl;j++) { 
fl=d++; 
f2 += twox; 

pi [ j] = (f 2*pl [j-1] -f l*pl [ j -2] ) /d; 

> 

> 

> 


Multidimensional Fits 

If you are measuring a single variable y as a function of more than one variable 
— say, a vector of variables x, then your basis functions will be functions of a vector, 
Xi(x),..., J 5 6 m(x). The x 1 merit function is now 

N 

All of the preceding discussion goes through unchanged, with x replaced by x. In 
fact, if you are willing to tolerate a bit of programming hack, you can use the above 
programs without any modification: In both If it and svdf it, the only use made 
of the array elements x [i] is that each element is in turn passed to the user-supplied 
routine f uncs, which duly gives back the values of the basis functions at that point. 
If you set x [i] =i before calling If it or svdf it, and independently provide f uncs 
with the true vector values of your data points (e.g., in global variables), then f uncs 
can translate from the fictitious x [i] ’s to the actual data points before doing its work. 


Ui ~ Ylk= 1 a kXk(^i, 


(15.4.24) 



CITED REFERENCES AND FURTHER READING: 

Bevington, RR. 1969, Data Reduction and Error Analysis for the Physical Sciences (New York: 
McGraw-Hill), Chapters 8-9. 


imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 




15.5 Nonlinear Models 


681 


Lawson, C.L., and Hanson, R. 1974, Solving Least Squares Problems (Englewood Cliffs, NJ: 
Prentice-Hall). 

Forsythe, G.E., Malcolm, M.A., and Moler, C.B. 1977, Computer Methods for Mathematical 
Computations (Englewood Cliffs, NJ: Prentice-Hall), Chapter 9. 


15.5 Nonlinear Models 

We now consider fitting when the model depends nonlinearly on the set of M 
unknown parameters a*,, fc = 1,2 ,M. We use the same approach as in previous 
sections, namely to define a % 2 merit function and determine best-fit parameters 
by its minimization. With nonlinear dependences, however, the minimization must 
proceed iteratively. Given trial values for the parameters, we develop a procedure 
that improves the trial solution. The procedure is then repeated until % 2 stops (or 
effectively stops) decreasing. 

How is this problem different from the general nonlinear function minimization 
problem already dealt with in Chapter 10? Superficially, not at all: Sufficiently 
close to the minimum, we expect the x 2 function to be well approximated by a 
quadratic form, which we can write as 

X 2 (a) « 7 — d • a + • D • a (15.5.1) 

where d is an M-vector and D is an M x M matrix. (Compare equation 10.6.1.) 
If the approximation is a good one, we know how to jump from the current trial 
parameters a CU r to the minimizing ones a m i n in a single leap, namely 

a min = a cur + D -1 • [-Vx 2 (a cur )] (15.5.2) 

(Compare equation 10.7.4.) 

On the other hand, (15.5.1) might be a poor local approximation to the shape 
of the function that we are trying to minimize at a cur . In that case, about all we 
can do is take a step down the gradient, as in the steepest descent method (§ 10 . 6 ). 
In other words. 


a ne xt = a cur — constant x Vx 2 (a CU r) (15.5.3) 

where the constant is small enough not to exhaust the downhill direction. 

To use (15.5.2) or (15.5.3), we must be able to compute the gradient of the x 2 
function at any set of parameters a. To use (15.5.2) we also need the matrix D, which 
is the second derivative matrix (Hessian matrix) of the % 2 merit function, at any a. 

Now, this is the crucial difference from Chapter 10: There, we had no way of 
directly evaluating the Hessian matrix. We were given only the ability to evaluate 
the function to be minimized and (in some cases) its gradient. Therefore, we had 
to resort to iterative methods not just because our function was nonlinear, but also 
in order to build up information about the Hessian matrix. Sections 10.7 and 10.6 
concerned themselves with two different techniques for building up this information. 
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Here, life is much simpler. We know exactly the form of x 2 , since it is based 
on a model function that we ourselves have specified. Therefore the Hessian matrix 
is known to us. Thus we are free to use (15.5.2) whenever we care to do so. The 
only reason to use (15.5.3) will be failure of (15.5.2) to improve the fit, signaling 
failure of (15.5.1) as a good local approximation. 

Calculation of the Gradient and Hessian 

The model to be fitted is 


y = y(x; a) 


and the x 2 merit function is 


X» = E 

i= 1 


Vi - y{xj; a) l 2 

o i 


(15.5.4) 


(15.5.5) 


The gradient of x 2 with respect to the parameters a, which will be zero at the x 2 
minimum, has components 


d'X 1 = [yj-y(xj; a)] dy(xj; a) 

da k “ of da k 


k=l,2,...,M 


(15.5.6) 


Taking an additional partial derivative gives 


d 2 x 2 

da k dai 



'dy{xj\ a) dy{xj\ a) 
da k dai 


~ [:Vi ~ y{xi\ a)] 


d 2 y(xj-, a) ' 
daida k 


(15.5.7) 


It is conventional to remove the factors of 2 by defining 


fik = ~ 


IV 

2 da k 


= 1 d 2 X 2 

~ 2 da k dai 


(15.5.8) 


making [a] = |D in equation (15.5.2), in terms of which that equation can be 
rewritten as the set of linear equations 


M 

E aki ^ ai = 

i=i 


(15.5.9) 


This set is solved for the increments 6ai that, added to the current approximation, 
give the next approximation. In the context of least-squares, the matrix [a], equal to 
one-half times the Hessian matrix, is usually called the curvature matrix. 

Equation (15.5.3), the steepest descent formula, translates to 



6ai = constant x /3; 


(15.5.10) 
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Note that the components ctki of the Hessian matrix (15.5.7) depend both on the 
first derivatives and on the second derivatives of the basis functions with respect to 
their parameters. Some treatments proceed to ignore the second derivative without 
comment. We will ignore it also, but only after a few comments. 

Second derivatives occur because the gradient (15.5.6) already has a dependence 
on dy/dak, so the next derivative simply must contain terms involving d 2 y/daidak- 
The second derivative term can be dismissed when it is zero (as in the linear case 
of equation 15.4.8), or small enough to be negligible when compared to the term 
involving the first derivative. It also has an additional possibility of being ignorably 
small in practice: The term multiplying the second derivative in equation (15.5.7) 
is [Vi ~ y( x t- a)]. For a successful model, this term should just be the random 
measurement error of each point. This error can have either sign, and should in 
general be uncorrelated with the model. Therefore, the second derivative terms tend 
to cancel out when summed over i. 

Inclusion of the second-derivative term can in fact be destabilizing if the model 
fits badly or is contaminated by outlier points that are unlikely to be offset by 
compensating points of opposite sign. From this point on, we will always use as 
the definition of a hi the formula 


E l dy(xi-,a) dy(xi\a) 
—o —-o- 

i=1 a i l oak oai 


(15.5.11) 



This expression more closely resembles its linear cousin (15.4.8). You should 
understand that minor (or even major) fiddling with [a] has no effect at all on 
what final set of parameters a is reached, but affects only the iterative route that is 
taken in getting there. The condition at the x 2 minimum, that (3 k = 0 for all 
is independent of how [a] is defined. 

Levenberg-Marquardt Method 

Marquardt [1 ] has put forth an elegant method, related to an earlier suggestion of 
Levenberg, for varying smoothly between the extremes of the inverse-Hessian method 
(15.5.9) and the steepest descent method (15.5.10). The latter method is used far from 
the minimum, switching continuously to the former as the minimum is approached. 
This Levenberg-Marquardt method (also called Marquardt method) works very well 
in practice and has become the standard of nonlinear least-squares routines. 

The method is based on two elementary, but important, insights. Consider the 
“constant” in equation (15.5.10). What should it be, even in order of magnitude? 
What sets its scale? There is no information about the answer in the gradient. 
That tells only the slope, not how far that slope extends. Marquardt’s first insight 
is that the components of the Hessian matrix, even if they are not usable in any 
precise fashion, give some information about the order-of-magnitude scale of the 
problem. 

The quantity x 2 is nondimensional, i.e., is a pure number; this is evident from 
its definition (15.5.5). On the other hand, (3k has the dimensions of 1/afc, which 
may well be dimensional, i.e., have units like cm -1 , or kilowatt-hours, or whatever. 
(In fact, each component of (3k can have different dimensions!) The constant of 
proportionality between (3k and Sak must therefore have the dimensions of a\. Scan 
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the components of [a] and you see that there is only one obvious quantity with these 
dimensions, and that is 1 fa^k, the reciprocal of the diagonal element. So that must 
set the scale of the constant. But that scale might itself be too big. So let’s divide 
the constant by some (nondimensional) fudge factor A, with the possibility of setting 
A » 1 to cut down the step. In other words, replace equation (15.5.10) by 

Sai = t—01 or A auSai=/3i (15.5.12) 

A an 

It is necessary that an be positive, but this is guaranteed by definition (15.5.11) — 
another reason for adopting that equation. 

Marquardt’s second insight is that equations (15.5.12) and (15.5.9) can be 
combined if we define a new matrix a' by the following prescription 

a jj = a jj(l + 
a 'jk = a ok (j 7 ^ *0 

and then replace both (15.5.12) and (15.5.9) by 

M 

ot' u Sa t = Pk 

i—i 


(15.5.13) 


(15.5.14) 


When A is very large, the matrix a' is forced into being diagonally dominant , so 
equation (15.5.14) goes over to be identical to (15.5.12). On the other hand, as A 
approaches zero, equation (15.5.14) goes over to (15.5.9). 

Given an initial guess for the set of fitted parameters a, the recommended 
Marquardt recipe is as follows: 

• Compute x 2 ( a )- 

• Pick a modest value for A, say A = 0.001. 

• (f) Solve the linear equations (15.5.14) for Sa and evaluate x 2 ( a + ^ a )- 

• If x 2 ( a + d' a ) >x' 2 (a), increase A by a factor of 10 (or any other 
substantial factor) and go back to (f). 

• If x 2 ( a + Sa) < X 2 ( a )> decrease A by a factor of 10, update the trial 
solution a <— a + Sa, and go back to (f). 

Also necessary is a condition for stopping. Iterating to convergence (to machine 
accuracy or to the roundoff limit) is generally wasteful and unnecessary since the 
minimum is at best only a statistical estimate of the parameters a. As we will see 
in §15.6, a change in the parameters that changes x 2 by an amount <C 1 is never 
statistically meaningful. 

Furthermore, it is not uncommon to find the parameters wandering 
around near the minimum in a flat valley of complicated topography. The rea¬ 
son is that Marquardt’s method generalizes the method of normal equations (§15.4), 
hence has the same problem as that method with regard to near-degeneracy of the 
minimum. Outright failure by a zero pivot is possible, but unlikely. More often, 
a small pivot will generate a large correction which is then rejected, the value of 
A being then increased. For sufficiently large A the matrix [ a'] is positive definite 
and can have no small pivots. Thus the method does tend to stay away from zero 
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pivots, but at the cost of a tendency to wander around doing steepest descent in 
very un-steep degenerate valleys. 

These considerations suggest that, in practice, one might as well stop iterating 
on the first or second occasion that x 2 decreases by a negligible amount, say either 
less than 0.01 absolutely or (in case roundoff prevents that being reached) some 
fractional amount like 10“ 3 . Don’t stop after a step where x 2 increases : That only 
shows that A has not yet adjusted itself optimally. 

Once the acceptable minimum has been found, one wants to set A = 0 and 
compute the matrix 


[C] = [a ]" 1 (15.5.15) 

which, as before, is the estimated covariance matrix of the standard errors in the 
fitted parameters a (see next section). 

The following pair of functions encodes Marquardt’s method for nonlinear 
parameter estimation. Much of the organization matches that used in If it of § 15.4. 
In particular the array ia[l. .ma] must be input with components one or zero 
corresponding to whether the respective parameter values a [ 1 . . ma] are to be fitted 
for or held fixed at their input values, respectively. 

The routine mrqmin performs one iteration of Marquardt’s method. It is first 
called (once) with alamda < 0, which signals the routine to initialize, alamda is set 
on the first and all subsequent calls to the suggested value of A for the next iteration; 
a and chisq are always given back as the best parameters found so far and their 
X 2 . When convergence is deemed satisfactory, set alamda to zero before a final call. 
The matrices alpha and covar (which were used as workspace in all previous calls) 
will then be set to the curvature and covariance matrices for the converged parameter 
values. The arguments alpha, a, and chisq must not be modified between calls, 
nor should alamda be, except to set it to zero for the final call. When an uphill 
step is taken, chisq and a are given back with their input (best) values, but alamda 
is set to an increased value. 

The routine mrqmin calls the routine mrqcof for the computation of the matrix 
[a] (equation 15.5.11) and vector (3 (equations 15.5.6 and 15.5.8). In turn mrqcof 
calls the user-supplied routine f uncs (x, a, y, dyda), which for input values x = x j 
and a = a calculates the model function y = y{x t : a) and the vector of derivatives 
dyda = dy/da k . 

#include "nrutil.h" 

void mrqmin(float x[], float y[], float sig[] , int ndata, float a[] , int ia[] , 
int ma, float **covar, float **alpha, float *chisq, 

void (*funcs)(float, float [], float *, float [], int), float *alamda) 
Levenberg-Marquardt method, attempting to reduce the value x 2 of a fit between a set of data 
points x[l.. ndata], y[l.. ndata] with individual standard deviations sig[l. .ndata] , 
and a nonlinear function dependent on ma coefficients a[l . .ma] . The input array ia[l . .ma] 
indicates by nonzero entries those components of a that should be fitted for, and by zero 
entries those components that should be held fixed at their input values. The program re¬ 
turns current best-fit values for the parameters a[l. .ma] , and x 2 — chisq. The arrays 
covarfl. .ma] [1. .ma] , alpha[l. .ma] [1. .ma] are used as working space during most 
iterations. Supply a routine funcs(x,a,yfit,dyda,ma) that evaluates the fitting function 
yfit, and its derivatives dyda[l. .ma] with respect to the fitting parameters a at x. On 
the first call provide an initial guess for the parameters a, and set alamda<0 for initialization 
(which then sets alamda=. 001). If a step succeeds chisq becomes smaller and alamda de¬ 
creases by a factor of 10. If a step fails alamda grows by a factor of 10. You must call this 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



686 


Chapter 15. Modeling of Data 


routine repeatedly until convergence is achieved. Then, make one final call with alamda=0, so 
that covar [1. .ma] [1. .ma] returns the covariance matrix, and alpha the curvature matrix. 
(Parameters held fixed will return zero covariances.) 

{ 

void covsrt(float **covar, int ma, int ia[], int mfit); 
void gaussj(float **a, int n, float **b, int m); 

void mrqcof (float x[], float y[], float sig[] , int ndata, float a[] , 
int ia[] , int ma, float **alpha, float beta[] , float *chisq, 
void (*funcs)(float, float [], float *, float [] , int)); 
int j,k,1; 
static int mfit; 

static float ochisq,*atry,*beta,*da,**oneda; 

if (*alamda < 0.0) { Initialization. 

atry=vector(l,ma); 
beta=vector(l,ma); 
da=vector(l,ma); 
for (mfit=0,j=l;j<=ma;j++) 
if (ia[j]) mfit++; 
oneda=matrix(1,mfit,1,1); 

*alamda=0.001; 

mrqcof(x,y,sig,ndata,a,ia,ma,alpha,beta,chisq,funcs); 
ochisq=(*chisq); 

for (j=l;j<=ma;j++) atry[j]=a[j] ; 

> 

for (j=l; j<=mfit; j++) { Alter linearized fitting matrix, by augmenting di- 

for (k=l ;k<=mf it ;k++) covar [j] [k]=alpha[j] [k] ; agonal elements, 

covar [j] [j]=alpha[j] [j] *(1.0+(*alamda)) ; 
oneda[j] [l]=beta[j] ; 

> 

gaussj (covar,mfit ,oneda, 1); Matrix solution, 

for (j=l;j<=mfit;j++) da[j]=oneda[j] [1] ; 

if (*alamda == 0.0) { Once converged, evaluate covariance matrix. 

covsrt(covar,ma,ia,mfit); 

covsrt(alpha,ma,ia,mfit); Spread out alpha to its full size too. 

free_matrix(oneda,1,mfit,1,1); 

free_vector(da,l,ma); 

free_vector(beta,l,ma); 

free_vector(atry,1,ma); 

return; 

> 

for (j=0,l=l;l<=ma;l++) Did the trial succeed? 

if (ia[l]) atry[l]=a[l]+da[++j] ; 
mrqcof(x,y,sig,ndata,atry,ia,ma,covar,da,chisq.funcs); 
if (*chisq < ochisq) { Success, accept the new solution. 

*alamda *= 0.1; 
ochisq=(*chisq); 
for (j=l;j<=mfit;j++) { 

for (k=l;k<=mfit;k++) alpha[j][k]=covar[j] [k]; 
beta[j]=da[j] ; 

> 

for (1=1;l<=ma; 1++) a[l]=atry[l] ; 

> else { Failure, increase alamda and return. 

*alamda *= 10.0; 

*chisq=ochisq; 

> 

> 



Notice the use of the routine covsrt from §15.4. This is merely for rearranging 
the covariance matrix covar into the order of all ma parameters. The above routine 
also makes use of 
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#include "nrutil.h" 

void mrqcof (float x[], float y[], float sig[], int ndata, float a[] , int ia[] , 
int ma, float **alpha, float beta[], float *chisq, 
void (*funcs)(float, float [], float *, float [] , int)) 

Used by mrqmin to evaluate the linearized fitting matrix alpha, and vector beta as in (15.5.8), 
and calculate x 2 - 
{ 

int i,j > k,l,m,mfit=0; 
float ymod,wt,sig2i,dy,*dyda; 

dyda=vector(1,ma); 
for (j=l;j<=ma;j++) 
if (ia[j]) mfit++; 

for (j=l;j<=mfit;j++) { Initialize (symmetric) alpha, beta, 

for (k=l;k<=j;k++) alpha[j][k]=0.0; 
beta [j] =0.0; 

} 

*chisq=0.0; 

for (i=l;i<=ndata;i++) { Summation loop over all data. 

(*funcs)(x[i],a,&ymod,dyda,ma); 
sig2i=l.0/(sig[i]*sig[i]); 
dy=y[i]-ymod; 
for (j=0,l=l;l<=ma;l++) { 
if (ia[l]) { 

wt=dyda[l]*sig2i; 

for (j++,k=0,m=l;m<=l;m++) 

if (ia[m]) alpha [j] [++k] += wt*dyda[m]; 
beta[j] += dy*wt; 

> 

> 

*chisq += dy*dy*sig2i; And find x 2 - 

> 

for (j=2; j<=mfit; j++) Fill in the symmetric side. 

for (k=l ;k<j ;k++) alpha[k] [j] =alpha[j] [k] ; 
free_vector(dyda,1,ma); 

> 


Example 

The following function fgauss is an example of a user-supplied function 
funcs. Used with the above routine mrqmin (in turn using mrqcof, covsrt, and 
gaussj), it fits for the model 


K 

y( x ) = Bk exp 

k= 1 


x - E k 
~Gk 


(15.5.16) 


which is a sum of K Gaussians, each having a variable position, amplitude, and 
width. We store the parameters in the order B\, E±, Gi, B 2 , E 2 , G 2 , ■ ■ ■, Bk, 
Ek, G k . 
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#include <math.h> 

void fgauss(float x, float a[], float *y, float dyda[] , int na) 

y(x;a) is the sum of na/3 Gaussians (15.5.16). The amplitude, center, and width of the 

Gaussians are stored in consecutive locations of a: a[i] = Bk, a[i+l] = E *., a[i+2] = Gk, 

l,...,na/3. The dimensions of the arrays are a[l. .na], dyda[l. .na], 

{ 

int i; 

float fac,ex,arg; 

*y=0.0; 

for (i=l;i<=na-l;i+=3) { 
arg=(x-a[i+l] )/a[i+2] ; 
ex=exp(-arg*arg); 
fac=a[i]*ex*2.0*arg; 

*y += a[i]*ex; 
dyda[i]=ex; 
dyda[i+l]=fac/a[i+2] ; 
dyda[i+2]=fac*arg/a[i+2] ; 

> 

> 


More Advanced Methods for Nonlinear Least Squares 

The Levenberg-Marquardt algorithm can be implemented as a model-trust 
region method for minimization (see §9.7 and ref. [2]) applied to the special case 
of a least squares function. A code of this kind due to More [3] can be found in 
MINPACK [4], Another algorithm for nonlinear least-squares keeps the second- 
derivative term we dropped in the Levenberg-Marquardt method whenever it would 
be better to do so. These methods are called “full Newton-type” methods and 
are reputed to be more robust than Levenberg-Marquardt, but more complex. One 
implementation is the code NL2SOL [5], 
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15.6 Confidence Limits on Estimated Model 
Parameters 


Several times already in this chapter we have made statements about the standard 
errors, or uncertainties, in a set of M estimated parameters a. We have given some 
formulas for computing standard deviations or variances of individual parameters 
(equations 15.2.9, 15.4.15, 15.4.19), as well as some formulas for covariances 
between pairs of parameters (equation 15.2.10; remark following equation 15.4.15; 
equation 15.4.20; equation 15.5.15). 

In this section, we want to be more explicit regarding the precise meaning 
of these quantitative uncertainties, and to give further information about how 
quantitative confidence limits on fitted parameters can be estimated. The subject 
can get somewhat technical, and even somewhat confusing, so we will try to make 
precise statements, even when they must be offered without proof. 

Figure 15.6.1 shows the conceptual scheme of an experiment that “measures” 
a set of parameters. There is some underlying true set of parameters a true that are 
known to Mother Nature but hidden from the experimenter. These true parameters 
are statistically realized, along with random measurement errors, as a measured data 
set, which we will symbolize as X>( 0 ). The data set V( 0) is known to the experimenter. 
He or she fits the data to a model by % 2 minimization or some other technique, and 
obtains measured, i.e., fitted, values for the parameters, which we here denote a ( 0 ) • 

Because measurement errors have a random component, V ( 0 ) is not a unique 
realization of the true parameters a true- Rather, there are infinitely many other 
realizations of the true parameters as “hypothetical data sets” each of which could 
have been the one measured, but happened not to be. Let us symbolize these 
by P( 2 ),.... Each one, had it been realized, would have given a slightly 
different set of fitted parameters, a (i), a^) ,..., respectively. These parameter sets 
a(j) therefore occur with some probability distribution in the M-dimensional space 
of all possible parameter sets a. The actual measured set a (o) is one member drawn 
from this distribution. 

Even more interesting than the probability distribution of a (, t ) would be the 
distribution of the difference a( j) — a trU e- This distribution differs from the former 
one by a translation that puts Mother Nature’s true value at the origin. If we knew this 
distribution, we would know everything that there is to know about the quantitative 
uncertainties in our experimental measurement a (o) . 

So the name of the game is to find some way of estimating or approximating 
the probability distribution of a ( t ) — a trU e without knowing a trU e and without having 
available to us an infinite universe of hypothetical data sets. 

Monte Carlo Simulation of Synthetic Data Sets 

Although the measured parameter set a (o) is not the true one, let us consider 
a fictitious world in which it was the true one. Since we hope that our measured 
parameters are not too wrong, we hope that that fictitious world is not too different 
from the actual world with parameters atrue- In particular, let us hope — no, let us 
assume — that the shape of the probability distribution a(j) — a( 0 ) in the fictitious 
world is the same, or very nearly the same, as the shape of the probability distribution 
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Figure 15.6.1. A statistical universe of data sets from an underlying model. True parameters atrue are 
realized in a data set, from which fitted (observed) parameters ao are obtained. If the experiment were 
repeated many times, new data sets and new values of the fitted parameters would be obtained. 

a(i) — a trU e in the real world. Notice that we are not assuming that a (o) and a trU e are 
equal; they are certainly not. We are only assuming that the way in which random 
errors enter the experiment and data analysis does not vary rapidly as a function of 
atrue, so that a(o) can serve as a reasonable surrogate. 

Now, often, the distribution of a^) — a( 0 ) in the fictitious world is within our 
power to calculate (see Figure 15.6.2). If we know something about the process 
that generated our data, given an assumed set of parameters a( 0 ), then we can 
usually figure out how to simulate our own sets of “synthetic” realizations of these 
parameters as “synthetic data sets.” The procedure is to draw random numbers from 
appropriate distributions (cf. §7.2—§7.3) so as to mimic our best understanding of 
the underlying process and measurement errors in our apparatus. With such random 
draws, we construct data sets with exactly the same numbers of measured points, and 
precisely the same values of all control (independent) variables, as our actual data set 

X>( 0 ). Let us call these simulated data sets T>^, T >^,_By construction these are 

supposed to have exactly the same statistical relationship to a (0 ) as the ’s have 
to a t rue- (For the case where you don’t know enough about what you are measuring 
to do a credible job of simulating it, see below.) 

Next, for each "D^, perform exactly the same procedure for estimation of 
parameters, e.g., \ 2 minimization, as was performed on the actual data to get 
the parameters a(o), giving simulated measured parameters a^,a^,.... Each 
simulated measured parameter set yields a point a®^ — a(o). Simulate enough data 
sets and enough derived simulated measured parameters, and you map out the desired 
probability distribution in M dimensions. 

In fact, the ability to do Monte Carlo simulations in this fashion has revo- 
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Figure 15.6.2. Monte Carlo simulation of an experiment. The fitted parameters from an actual experiment 
are used as surrogates for the true parameters. Computer-generated random numbers are used to simulate 
many synthetic data sets. Each of these is analyzed to obtain its fitted parameters. The distribution of 
these fitted parameters around the (known) surrogate true parameters is thus studied. 



lutionized many fields of modern experimental science. Not only is one able to 
characterize the errors of parameter estimation in a very precise way; one can also 
try out on the computer different methods of parameter estimation, or different data 
reduction techniques, and seek to minimize the uncertainty of the result according 
to any desired criteria. Offered the choice between mastery of a five-foot shelf of 
analytical statistics books and middling ability at performing statistical Monte Carlo 
simulations, we would surely choose to have the latter skill. 

Quick-and-Dirty Monte Carlo: The Bootstrap Method 

Here is a powerful technique that can often be used when you don’t know 
enough about the underlying process, or the nature of your measurement errors, 
to do a credible Monte Carlo simulation. Suppose that your data set consists of 
N independent and identically distributed (or iid) “data points.” Each data point 
probably consists of several numbers, e.g., one or more control variables (uniformly 
distributed, say, in the range that you have decided to measure) and one or more 
associated measured values (each distributed however Mother Nature chooses). “lid” 
means that the sequential order of the data points is not of consequence to the process 
that you are using to get the fitted parameters a. For example, a x 2 sum like 
(15.5.5) does not care in what order the points are added. Even simpler examples 
are the mean value of a measured quantity, or the mean of some function of the 
measured quantities. 

The bootstrap method [1 ] uses the actual data set with its N data points, to 
generate any number of synthetic data sets J D? 1 y'D? i y ..., also with N data points. 
The procedure is simply to draw N data points at a time with replacement from the 
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set Dfoy Because of the replacement, you do not simply get back your original 
data set each time. You get sets in which a random fraction of the original points, 
typically ~ 1/e » 37%, are replaced by duplicated original points. Now, exactly 
as in the previous discussion, you subject these data sets to the same estimation 
procedure as was performed on the actual data, giving a set of simulated measured 
parameters a£),.... These will be distributed around a( 0 ) in close to the same 
way that a(o) is distributed around atrue- 

Sounds like getting something for nothing, doesn’t it? In fact, it has taken more 
than a decade for the bootstrap method to become accepted by statisticians. By now, 
however, enough theorems have been proved to render the bootstrap reputable (see [2] 
for references). The basic idea behind the bootstrap is that the actual data set, viewed 
as a probability distribution consisting of delta functions at the measured values, is 
in most cases the best — or only — available estimator of the underlying probability 
distribution. It takes courage, but one can often simply use that distribution as the 
basis for Monte Carlo simulations. 

Watch out for cases where the bootstrap’s “iid” assumption is violated. For 
example, if you have made measurements at evenly spaced intervals of some control 
variable, then you can usually get away with pretending that these are “iid,” uniformly 
distributed over the measured range. However, some estimators of a (e.g., ones 
involving Fourier methods) might be particularly sensitive to all the points on a grid 
being present. In that case, the bootstrap is going to give a wrong distribution. Also 
watch out for estimators that look at anything like small-scale clumpiness within the 
N data points, or estimators that sort the data and look at sequential differences. 
Obviously the bootstrap will fail on these, too. (The theorems justifying the method 
are still true, but some of their technical assumptions are violated by these examples.) 

For a large class of problems, however, the bootstrap does yield easy, very 
quick, Monte Carlo estimates of the errors in an estimated parameter set. 

Confidence Limits 



Rather than present all details of the probability distribution of errors in 
parameter estimation, it is common practice to summarize the distribution in the 
form of confidence limits. The full probability distribution is a function defined 
on the M-dimensional space of parameters a. A confidence region (or confidence 
interval ) is just a region of that M-dimensional space (hopefully a small region) that 
contains a certain (hopefully large) percentage of the total probability distribution. 
You point to a confidence region and say, e.g., “there is a 99 percent chance that the 
true parameter values fall within this region around the measured value.” 

It is worth emphasizing that you, the experimenter, get to pick both the 
confidence level (99 percent in the above example), and the shape of the confidence 
region. The only requirement is that your region does include the stated percentage 
of probability. Certain percentages are, however, customary in scientific usage: 68.3 
percent (the lowest confidence worthy of quoting), 90 percent, 95.4 percent, 99 
percent, and 99.73 percent. Higher confidence levels are conventionally “ninety-nine 
point nine ... nine.” As for shape, obviously you want a region that is compact 
and reasonably centered on your measurement a(o), since the whole purpose of a 
confidence limit is to inspire confidence in that measured value. In one dimension, 
the convention is to use a line segment centered on the measured value; in higher 
dimensions, ellipses or ellipsoids are most frequently used. 
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Figure 15.6.3. Confidence intervals in 1 and 2 dimensions. The same fraction of measured points (here 
68%) lies (i) between the two vertical lines, (ii) between the two horizontal lines, (iii) within the ellipse. 

You might suspect, correctly, that the numbers 68.3 percent, 95.4 percent, 
and 99.73 percent, and the use of ellipsoids, have some connection with a normal 
distribution. That is true historically, but not always relevant nowadays. In general, 
the probability distribution of the parameters will not be normal, and the above 
numbers, used as levels of confidence, are purely matters of convention. 

Figure 15.6.3 sketches a possible probability distribution for the case M = 2. 
Shown are three different confidence regions which might usefully be given, all at 
the same confidence level. The two vertical lines enclose a band (horizontal interval) 
which represents the 68 percent confidence interval for the variable a i without regard 
to the value of a 2 . Similarly the horizontal lines enclose a 68 percent confidence 
interval for a 2 - The ellipse shows a 68 percent confidence interval for a 1 and a 2 
jointly. Notice that to enclose the same probability as the two bands, the ellipse must 
necessarily extend outside of both of them (a point we will return to below). 

Constant Chi-Square Boundaries as Confidence Limits 

When the method used to estimate the parameters a ( 0 ) is chi-square minimiza¬ 
tion, as in the previous sections of this chapter, then there is a natural choice for the 
shape of confidence intervals, whose use is almost universal. For the observed data 
set 27( 0 ), the value of \ 2 is a minimum at a(o). Call this minimum value Xmin- if 
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Figure 15.6.4. Confidence region ellipses corresponding to values of chi-square larger than the fitted 
minimum. The solid curves, with Ax 2 = 1.00, 2.71,6.63 project onto one-dimensional intervals AA, 
BB', CC'. These intervals — not the ellipses themselves — contain 68.3%, 90%, and 99% of normally 
distributed data. The ellipse that contains 68.3% of normally distributed data is shown dashed, and has 
Ax 2 = 2.30. For additional numerical values, see accompanying table. 


the vector a of parameter values is perturbed away from a ( 0 ), then y 2 increases. The 
region within which y 2 increases by no more than a set amount Ay 2 defines some 
M-dimensional confidence region around a ( 0 ). If Ay 2 is set to be a large number, 
this will be a big region; if it is small, it will be small. Somewhere in between there 
will be choices of Ay 2 that cause the region to contain, variously, 68 percent, 90 
percent, etc. of probability distribution for a’s, as defined above. These regions are 
taken as the confidence regions for the parameters a (o). 

Very frequently one is interested not in the full M-dimensional confidence 
region, but in individual confidence regions for some smaller number v of parameters. 
For example, one might be interested in the confidence interval of each parameter 
taken separately (the bands in Figure 15.6.3), in which case v = 1. In that case, 
the natural confidence regions in the ^-dimensional subspace of the M-dimensional 
parameter space are the projections of the M-dimensional regions defined by fixed 
Ay 2 into the //-dimensional spaces of interest. In Figure 15.6.4, for the case M = 2, 
we show regions corresponding to several values of Ay 2 . The one-dimensional 
confidence interval in a 2 corresponding to the region bounded by Ay 2 = 1 lies 
between the lines A and A'. 

Notice that the projection of the higher-dimensional region on the lower- 
dimension space is used, not the intersection. The intersection would be the band 
between Z and Z'. It is never used. It is shown in the figure only for the purpose of 
making this cautionary point, that it should not be confused with the projection. 
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Probability Distribution of Parameters in the Normal Case 

You may be wondering why we have, in this section up to now, made no 
connection at all with the error estimates that come out of the % 2 fitting procedure, 
most notably the covariance matrix Cy. The reason is this: y 2 minimization 
is a useful means for estimating parameters even if the measurement errors are 
not normally distributed. While normally distributed errors are required if the y 2 
parameter estimate is to be a maximum likelihood estimator (§15.1), one is often 
willing to give up that property in return for the relative convenience of the y 2 
procedure. Only in extreme cases, measurement error distributions with very large 
“tails,” is y 2 minimization abandoned in favor of more robust techniques, as will 
be discussed in §15.7. 

However, the formal covariance matrix that comes out of a y 2 minimization has 
a clear quantitative interpretation only if (or to the extent that) the measurement errors 
actually are normally distributed. In the case of nonnormal errors, you are “allowed” 

• to fit for parameters by minimizing y 2 

• to use a contour of constant Ay 2 as the boundary of your confidence region 

• to use Monte Carlo simulation or detailed analytic calculation in deter¬ 
mining which contour Ay 2 is the correct one for your desired confidence 
level 

• to give the covariance matrix C', 7 as the “formal covariance matrix of 
the fit.” 

You are not allowed 

• to use formulas that we now give for the case of normal errors, which 
establish quantitative relationships among Ay 2 , Cy, and the confidence 
level. 

Here are the key theorems that hold when (i) the measurement errors are 
normally distributed, and either (ii) the model is linear in its parameters or (iii) the 
sample size is large enough that the uncertainties in the fitted parameters a do not 
extend outside a region in which the model could be replaced by a suitable linearized 
model. [Note that condition (iii) does not preclude your use of a nonlinear routine 
like mqrf it to find the fitted parameters.] 

Theorem A. y 2 lin is distributed as a chi-square distribution with N — M 
degrees of freedom, where N is the number of data points and M is the number of 
fitted parameters. This is the basic theorem that lets you evaluate the goodness-of-fit 
of the model, as discussed above in §15.1. We list it first to remind you that unless 
the goodness-of-fit is credible, the whole estimation of parameters is suspect. 

Theorem B. If alk is drawn from the universe of simulated data sets with 
actual parameters a( 0 ), then the probability distribution of <5a = ajyj — a( 0 ) is the 
multivariate normal distribution 

P(<5a) da \... da,M = const, x exp ^cia • [a] • (5a^ da\... da,M 

where [a] is the curvature matrix defined in equation (15.5.8). 

Theorem C. If a^y is drawn from the universe of simulated data sets with 
actual parameters a(o), then the quantity Ay 2 = y 2 (ay-)) — y 2 (a( 0 )) is distributed as 
a chi-square distribution with M degrees of freedom. Here the y 2 ’s are all evaluated 
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using the fixed (actual) data set T>( 0 y This theorem makes the connection between 
particular values of Ay 2 and the fraction of the probability distribution that they 
enclose as an M-dimensional region, i.e., the confidence level of the M-dimensional 
confidence region. 

Theorem D. Suppose that is drawn from the universe of simulated data 
sets (as above), that its first v components aa„ are held fixed, and that its 
remaining M — v components are varied so as to minimize y 2 . Call this minimum 
value y 2 . Then Ay 2 = y 2 — Xmin' s distributed as a chi-square distribution with 
v degrees of freedom. If you consult Figure 15.6.4, you will see that this theorem 
connects the projected Ay 2 region with a confidence level. In the figure, a point that 
is held fixed in 02 and allowed to vary in a\ minimizing y 2 will seek out the ellipse 
whose top or bottom edge is tangent to the line of constant 02 , and is therefore the 
line that projects it onto the smaller-dimensional space. 

As a first example, let us consider the case v = 1, where we want to find 
the confidence interval of a single parameter, say a\. Notice that the chi-square 
distribution with v = 1 degree of freedom is the same distribution as that of the square 
of a single normally distributed quantity. Thus Ay 2 < 1 occurs 68.3 percent of the 
time (1-cr for the normal distribution), Ay 2 < 4 occurs 95.4 percent of the time (2-cr 
for the normal distribution), Ay 2 < 9 occurs 99.73 percent of the time (3-cr for the 
normal distribution), etc. In this manner you find the Ay 2 that corresponds to your 
desired confidence level. (Additional values are given in the accompanying table.) 

Let 5a be a change in the parameters whose first component is arbitrary, 5a 1 , 
but the rest of whose components are chosen to minimize the Ay 2 . Then Theorem 
D applies. The value of Ay 2 is given in general by 


Ay 2 = 5a ■ [a] • 5a (15.6.1) 

which follows from equation (15.5.8) applied at y 2 niri where tik = 0. Since <5a by 
hypothesis minimizes y 2 in all but its first component, the second through M th 
components of the normal equations (15.5.9) continue to hold. Therefore, the 
solution of (15.5.9) is 


( c \ 


( c \ 

0 

= [C1- 

0 1 



K0) 


where c is one arbitrary constant that we get to adjust to make (15.6.1) give the 
desired left-hand value. Plugging (15.6.2) into (15.6.1) and using the fact that [ C] 
and [a] are inverse matrices of one another, we get 

c = 6ai/Cn and Ay 2 = (&i) 2 /Cn (15.6.3) 

or 

4</i ^yfy/Ay 2 y/C u (15.6.4) 


At last! A relation between the confidence interval ±<5ai and the formal 
standard error ay = y/Cu. Not unreasonably, we find that the 68 percent confidence 
interval is ±oy, the 95 percent confidence interval is ±2cri, etc. 
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Ay 2 as a Function of Confidence Level and Degrees of Freedom 

P 

1 

2 

3 

V 

4 

5 

6 

68.3% 

1.00 

2.30 

3.53 

4.72 

5.89 

7.04 

90% 

2.71 

4.61 

6.25 

7.78 

9.24 

10.6 

95.4% 

4.00 

6.17 

8.02 

9.70 

11.3 

12.8 

99% 

6.63 

9.21 

11.3 

13.3 

15.1 

16.8 

99.73% 

9.00 

11.8 

14.2 

16.3 

18.2 

20.1 

99.99% 

15.1 

18.4 

21.1 

23.5 

25.7 

27.8 


These considerations hold not just for the individual parameters a,, but also 
for any linear combination of them: If 

M 

& = ^Cjaj = c-a (15.6.5) 

k= i 

then the 68 percent confidence interval on b is 

5b = ± Vc • [C\ ■ c (15.6.6) 

However, these simple, normal-sounding numerical relationships do not hold in 
the case v > 1 [3], In particular, Ay 2 = 1 is not the boundary, nor does it project 
onto the boundary, of a 68.3 percent confidence region when v > 1. If you want 
to calculate not confidence intervals in one parameter, but confidence ellipses in 
two parameters jointly, or ellipsoids in three, or higher, then you must follow the 
following prescription for implementing Theorems C and D above: 

• Let v be the number of fitted parameters whose joint confidence region you 
wish to display, v <M. Call these parameters the “parameters of interest.” 

• Let p be the confidence limit desired, e.g., p = 0.68 or p = 0.95. 

• Find A (i.e., Ay 2 ) such that the probability of a chi-square variable with 
v degrees of freedom being less than A is p. For some useful values of p 
and v, A is given in the table. For other values, you can use the routine 
gammq and a simple root-finding routine (e.g., bisection) to find A such 
that gammq(i//2, A/2) = 1 — p. 

• Take the M x M covariance matrix [67] = [cc] -1 of the chi-square fit. 
Copy the intersection of the v rows and columns corresponding to the 
parameters of interest into a v x v matrix denoted [C pro j]. 

• Invert the matrix [C pro j ]. (In the one-dimensional case this was just taking 
the reciprocal of the element C\\.) 

• The equation for the elliptical boundary of your desired confidence region 
in the ^-dimensional subspace of interest is 

A = 5a! • [Cproj ] -1 • 5a! 

where 5a' is the //-dimensional vector of parameters of interest. 



(15.6.7) 
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Figure 15.6.5. Relation of the confidence region ellipse Ax 2 = 1 to quantities computed by singular 
value decomposition. The vectors are unit vectors along the principal axes of the confidence region. 
The semi-axes have lengths equal to the reciprocal of the singular values tq. If the axes are all scaled 
by some constant factor a, Ax 2 is scaled by the factor a 2 . 

If you are confused at this point, you may find it helpful to compare Figure 
15.6.4 and the accompanying table, considering the case M — 2 with v = 1 and 
v = 2. You should be able to verify the following statements: (i) The horizontal 
band between C and C' contains 99 percent of the probability distribution, so it 
is a confidence limit on 02 alone at this level of confidence, (ii) Ditto the band 
between B and B' at the 90 percent confidence level, (iii) The dashed ellipse, 
labeled by Ay 2 = 2.30, contains 68.3 percent of the probability distribution, so it is 
a confidence region for 01 and a 2 jointly, at this level of confidence. 

Confidence Limits from Singular Value Decomposition 

When you have obtained your y 2 fit by singular value decomposition (§ 15.4), the 
information about the fit’s formal errors comes packaged in a somewhat different, but 
generally more convenient, form. The columns of the matrix V are an orthonormal 
set of M vectors that are the principal axes of the Ay 2 = constant ellipsoids. 
We denote the columns as V(i)... V(m)- The lengths of those axes are inversely 
proportional to the corresponding singular values 101 ... wm', see Figure 15.6.5. The 
boundaries of the ellipsoids are thus given by 

Ay 2 = w?(V (1) • 5a ) 2 + • • • + w 2 M (\ {M) ■ 5a ) 2 (15.6.8) 

which is the justification for writing equation (15.4.18) above. Keep in mind that 
it is much easier to plot an ellipsoid given a list of its vector principal axes, than 
given its matrix quadratic form! 

The formula for the covariance matrix [C] in terms of the columns Y ^ is 




or, in components. 


Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 




15.7 Robust Estimation 


699 




(15.6.10) 


CITED REFERENCES AND FURTHER READING: 

Efron, B. 1982, The Jackknife, the Bootstrap, andOther Resampling P/ans (Philadelphia: S.I.A.M.). 
[ 1 ] 

Efron, B., and Tibshirani, R. 1986, Statistical Science vol. 1, pp. 54-77. [2] 

Avni, Y. 1976, Astrophysicai Journal, vol. 210, pp. 642-646. [3] 

Lampton, M., Margon, M., and Bowyer, S. 1976, Astrophysicai Journal, vol. 208, pp. 177-190. 
Brownlee, K.A. 1965, Statistical Theory and Methodology, 2nd ed. (New York: Wiley). 

Martin, B.R. 1971, Statistics for Physicists (New York: Academic Press). 


15.7 Robust Estimation 

The concept of robustness has been mentioned in passing several times already. 
In §14.1 we noted that the median was a more robust estimator of central value than 
the mean; in §14.6 it was mentioned that rank correlation is more robust than linear 
correlation. The concept of outlier points as exceptions to a Gaussian model for 
experimental error was discussed in §15.1. 

The term “robust” was coined in statistics by G.E.P. Box in 1953. Various 
definitions of greater or lesser mathematical rigor are possible for the term, but in 
general, referring to a statistical estimator, it means “insensitive to small departures 
from the idealized assumptions for which the estimator is optimized.” [1,2] The word 
“small” can have two different interpretations, both important: either fractionally 
small departures for all data points, or else fractionally large departures for a small 
number of data points. It is the latter interpretation, leading to the notion of outlier 
points, that is generally the most stressful for statistical procedures. 

Statisticians have developed various sorts of robust statistical estimators. Many, 
if not most, can be grouped in one of three categories. 

M-estimates follow from maximum-likelihood arguments very much as equa¬ 
tions (15.1.5) and (15.1.7) followed from equation (15.1.3). M-estimates are usually 
the most relevant class for model-fitting, that is, estimation of parameters. We 
therefore consider these estimates in some detail below. 

L-estimates are “linear combinations of order statistics.” These are most 
applicable to estimations of central value and central tendency, though they can 
occasionally be applied to some problems in estimation of parameters. Two 
“typical” L-estimates will give you the general idea. They are (i) the median, and 
(ii) Tukey’s trimean, defined as the weighted average of the first, second, and third 
quartile points in a distribution, with weights 1/4, 1/2, and 1/4, respectively. 

R-estimates are estimates based on rank tests. For example, the equality or 
inequality of two distributions can be estimated by the Wilcoxon test of computing 
the mean rank of one distribution in a combined sample of both distributions. 
The Kolmogorov-Smirnov statistic (equation 14.3.6) and the Spearman rank-order 
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Figure 15.7.1. Examples where robust statistical methods are desirable: (a) A one-dimensional 
distribution with a tail of outliers; statistical fluctuations in these outliers can prevent accurate determination 
of the position of the central peak, (b) A distribution in two dimensions fitted to a straight line; non-robust 
techniques such as least-squares fitting can have undesired sensitivity to outlying points. 

correlation coefficient (14.6.1) are R-estimates in essence, if not always by formal 
definition. 

Some other kinds of robust techniques, coming from the fields of optimal control 
and filtering rather than from the field of mathematical statistics, are mentioned at 
the end of this section. Some examples where robust statistical methods are desirable 
are shown in Figure 15.7.1. 

Estimation of Parameters by Local M-Estimates 

Suppose we know that our measurement errors are not normally distributed. 
Then, in deriving a maximum-likelihood formula for the estimated parameters a in a 
model y(x; a), we would write instead of equation (15.1.3) 

N 

p = n { exp (^> v i x i ; a »] A v } 



(15.7.1) 
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where the function p is the negative logarithm of the probability density. Taking the 
logarithm of (15.7.1) analogously with (15.1.4), we find that we want to minimize 
the expression 


N 

^P(yi,y{xi-,a}) (15.7.2) 

i=1 

Very often, it is the case that the function p depends not independently on its 
two arguments, measured yi and predicted y{x t ), but only on their difference, at least 
if scaled by some weight factors <jj which we are able to assign to each point. In this 
case the M-estimate is said to be local, and we can replace (15.7.2) by the prescription 

minimize over a £ p ( 15 . 7 . 3 ) 


where the function p(z) is a function of a single variable 2 = [y, — y(x,)]/cr ? ;. 
If we now define the derivative of p(z) to be a function ip(z), 


i’iz ) 


dp(z) 

dz 


(15.7.4) 


then the generalization of (15.1.7) to the case of a general M-estimate is 



( yi - y{xi )> 

\ (dy{x t :&)\ 

V * V 

1 V da k ) 




(15.7.5) 


If you compare (15.7.3) to (15.1.3), and (15.7.5) to (15.1.7), you see at once 
that the specialization for normally distributed errors is 

p(z) = ^ 2 2 i/>(z) = z (normal) (15.7.6) 

If the errors are distributed as a double or two-sided exponential, namely 

(15.7.7) 

then, by contrast, 

p(x) = \z\ ip(z) = sgn(z) (double exponential) (15.7.8) 


Prob {yi — y(xi)} ~ exp 


Comparing to equation (15.7.3), we see that in this case the maximum likelihood 
estimator is obtained by minimizing the mean absolute deviation, rather than the 
mean square deviation. Here the tails of the distribution, although exponentially 
decreasing, are asymptotically much larger than any corresponding Gaussian. 

A distribution with even more extensive — therefore sometimes even more 
realistic — tails is the Cauchy or Lorentzian distribution, 



Prob {yi - y(xi)} 


(15.7.9) 
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This implies 

p(z) = log ^1 + X z 2 ^j ip(z) = | (Lorentzian) (15.7.10) 

Notice that the ip function occurs as a weighting function in the generalized 
normal equations (15.7.5). For normally distributed errors, equation (15.7.6) says 
that the more deviant the points, the greater the weight. By contrast, when tails are 
somewhat more prominent, as in (15.7.7), then (15.7.8) says that all deviant points 
get the same relative weight, with only the sign information used. Finally, when 
the tails are even larger, (15.7.10) says the ip increases with deviation, then starts 
decreasing, so that very deviant points — the true outliers — are not counted at all 
in the estimation of the parameters. 

This general idea, that the weight given individual points should first increase 
with deviation, then decrease, motivates some additional prescriptions for ip which 
do not especially correspond to standard, textbook probability distributions. Two 
examples are 

Andrew’s sine 


-ip(z) = 


sin(z/c) 

0 


\z\ < C7T 
\z\ > C7T 


(15.7.11) 


If the measurement errors happen to be normal after all, with standard deviations o j, 
then it can be shown that the optimal value for the constant c is c = 2.1. 

Tukey’s biweight 


ip(z) = 


(z(l-z 2 /c 2 f 

\z\ < c 

l 0 

\z\ > c 


(15.7.12) 


where the optimal value of c for normal errors is c = 6.0. 


Numerical Calculation of M-Estimates 


To fit a model by means of an M-estimate, you first decide which M-estimate 
you want, that is, which matching pair p, ip you want to use. We rather like 
(15.7.8) or (15.7.10). 

You then have to make an unpleasant choice between two fairly difficult 
problems. Either find the solution of the nonlinear set of M equations (15.7.5), or 
else minimize the single function in M variables (15.7.3). 

Notice that the function (15.7.8) has a discontinuous ip, and a discontinuous 
derivative for p. Such discontinuities frequently wreak havoc on both general 
nonlinear equation solvers and general function minimizing routines. You might 
now think of rejecting (15.7.8) in favor of (15.7.10), which is smoother. However, 
you will find that the latter choice is also bad news for many general equation solving 
or minimization routines: small changes in the fitted parameters can drive ip(z) 
off its peak into one or the other of its asymptotically small regimes. Therefore, 
different terms in the equation spring into or out of action (almost as bad as analytic 
discontinuities). 

Don’t despair. If your computer budget (or, for personal computers, patience) 
is up to it, this is an excellent application for the downhill simplex minimization 
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algorithm exemplified in amoeba §10.4 or amebsa in §10.9. Those algorithms make 
no assumptions about continuity; they just ooze downhill and will work for virtually 
any sane choice of the function p. 

It is very much to your (financial) advantage to find good starting values, 
however. Often this is done by first fitting the model by the standard % 2 (nonrobust) 
techniques, e.g., as described in §15.4 or §15.5. The fitted parameters thus obtained 
are then used as starting values in amoeba, now using the robust choice of p and 
minimizing the expression (15.7.3). 

Fitting a Line by Minimizing Absolute Deviation 

Occasionally there is a special case that happens to be much easier than is 
suggested by the general strategy outlined above. The case of equations (15.7.7)- 
(15.7.8), when the model is a simple straight line 

y(x; a, b) = a + bx (15.7.13) 

and where the weights <r» are all equal, happens to be such a case. The problem is 
precisely the robust version of the problem posed in equation (15.2.1) above, namely 
fit a straight line through a set of data points. The merit function to be minimized is 

JV 

\yi — a — bxi\ (15.7.14) 



rather than the % 2 given by equation (15.2.2). 

The key simplification is based on the following fact: The median cm of a set 
of numbers c* is also that value which minimizes the sum of the absolute deviations 


E 


|Cj - Cm 



(Proof: Differentiate the above expression with respect to cm and set it to zero.) 

It follows that, for fixed b, the value of a that minimizes (15.7.14) is 

a = median {yi — bxi} 

Equation (15.7.5) for the parameter b is 

N 

0 = Xi sgn (y t — a— bxt) 

i= 1 

(where sgn(0) is to be interpreted as zero). If we replace a in this equation by the 
implied function a(b) of (15.7.15), then we are left with an equation in a single 
variable which can be solved by bracketing and bisection, as described in §9.1. 
(In fact, it is dangerous to use any fancier method of root-finding, because of the 
discontinuities in equation 15.7.16.) 

Here is a routine that does all this. It calls select (§8.5) to find the median. 
The bracketing and bisection are built in to the routine, as is the % 2 solution that 
generates the initial guesses for a and b. Notice that the evaluation of the right-hand 
side of (15.7.16) occurs in the function rofunc, with communication via global 
(top-level) variables. 


(15.7.15) 


(15.7.16) 
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#include <math.h> 

#include "nrutil.h" 
int ndatat; 

float *xt,*yt,aa,abdevt; 

void medfit(float x[], float y[], int ndata, float *a, float *b, float *abdev) 
Fits y = a + bx by the criterion of least absolute deviations. The arrays x[l. .ndata] and 
y[l. .ndata] are the input experimental points. The fitted parameters a and b are output, 
along with abdev, which is the mean absolute deviation (in y) of the experimental points from 
the fitted line. This routine uses the routine rofunc, with communication via global variables. 
{ 

float rofuncffloat b); 
int j; 

float bb,bl,b2,del,f,f1,f2,sigb,temp; 

float sx=0.0, sy=0.0,sxy=0.0, sxx=0.0, chisq=0.0; 

ndatat=ndata; 


: (j=l;j<=ndata;j++) { 
sx += x[j] ; 

s y += y[j]; 

sxy += x[]] *y [j] ; 
sxx += x[j]*x[j] ; 


a first guess for a and b 
least-squares fitting line. 


del=ndata*sxx-sx*sx; 

aa=(sxx*sy-sx*sxy)/del; Least-squares solutions. 

bb=(ndata*sxy-sx*sy)/del; 
for (j=l;j<=ndata;j++) 

chisq += (temp=y[j]-(aa+bb*x[j]),temp*temp); 
sigb=sqrt(chisq/del); The standard deviation will give som 

bl=bb; how big an iteration step to take, 

fl=rofunc(bl); 
if (sigb > 0.0) { 

b2=bb+SIGN(3.0*sigb,f1); 

Guess bracket as 3-<x away, in the downhill direction known from fl. 
f2=rofunc(b2); 
if (b2 == bl) { 


*b=bb; 

*abdev=abdevt/ndat a 
return; 


le (fl*f2 > 0.0) { 
bb=b2+l.6*(b2-bl); 
bl=b2: 


sigb=0.01*sigb; 
while (fabs(b2-bl) > si 
bb=bl+0.5*(b2-bl); 


Refine until erro 
deviations. 
Bisection. 


a negligible number of standard 


if (bb == bl || bb 
f=rofunc(bb); 
if (f*fl >= 0.0) { 
fl=f; 
bl=bb; 

> else { 


bb == b2) break; 
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*a=aa; 

*b=bb; 

*abdev=abdevt/ndata; 


#include <math.h> 

#include "nrutil.h" 

#define EPS 1.0e-7 

extern int ndatat; Defined in medfit. 

extern float *xt,*yt,aa,abdevt; 

float rofunc(float b) 

Evaluates the right-hand side of equation (15.7.16) for a given value of b. Communication with 
the routine medfit is through global variables. 

i 

float select(unsigned long k, unsigned long n, float arr[]); 
int j; 

float *arr,d,sum=0.0; 
arr=vector(l,ndatat); 

for (j=1; j<=ndatat; j ++) arr[j]=yt[j]-b*xt[j] ; 
if (ndatat & 1) { 

aa=select ( (ndatat+1) »1,ndatat, arr) ; 

> 

else { 

j=ndatat » 1; 

aa=0.5* (select ( j ,ndatat, arr) +select (j+1,ndatat, arr) ) ; 

> 

abdevt=0.0; 

for (j=l;j<=ndatat;j++) { 
d=yt[j]-(b*xt[j]+aa); 
abdevt += fabs(d); 
if (yt[j] ! = 0.0) d /= fabs(yt[j]); 

if (fabs(d) > EPS) sum += (d >= 0.0 ? xt[j] : -xt[j]); 

> 

free_vector(arr,1,ndatat); 
return sum; 

> 


Other Robust Techniques 

Sometimes you may have a priori knowledge about the probable values and 
probable uncertainties of some parameters that you are trying to estimate from a data 
set. In such cases you may want to perform a fit that takes this advance information 
properly into account, neither completely freezing a parameter at a predetermined 
value (as in If it §15.4) nor completely leaving it to be determined by the data set. 
The formalism for doing this is called “use of a priori covariances.” 

A related problem occurs in signal processing and control theory, where it is 
sometimes desired to “track” (i.e., maintain an estimate of) a time-varying signal in 
the presence of noise. If the signal is known to be characterized by some number 
of parameters that vary only slowly, then the formalism of Kalman filtering tells 
how the incoming, raw measurements of the signal should be processed to produce 
best parameter estimates as a function of time. For example, if the signal is a 
frequency-modulated sine wave, then the slowly varying parameter might be the 
instantaneous frequency. The Kalman filter for this case is called a phase-locked 
loop and is implemented in the circuitry of good radio receivers [3,4]. 
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Chapter 16. Integration of Ordinary 
Differential Equations 


16.0 Introduction 

Problems involving ordinary differential equations (ODEs) can always be 
reduced to the study of sets of first-order differential equations. For example the 
second-order equation 


d 2 y dy 

_ +?w _ =r(l ) 


can be rewritten as two first-order equations 


dy_ 

dx 

dz 

dx 


= z(x) 

= r(x) - q(x)z(x) 


(16.0.1) 


(16.0.2) 


where z is a new variable. This exemplifies the procedure for an arbitrary ODE. The 
usual choice for the new variables is to let them be just derivatives of each other (and 
of the original variable). Occasionally, it is useful to incorporate into their definition 
some other factors in the equation, or some powers of the independent variable, 
for the purpose of mitigating singular behavior that could result in overflows or 
increased roundoff error. Let common sense be your guide: If you find that the 
original variables are smooth in a solution, while your auxiliary variables are doing 
crazy things, then figure out why and choose different auxiliary variables. 

The generic problem in ordinary differential equations is thus reduced to the 
study of a set of N coupled first-order differential equations for the functions 
y ,, i = 1,2,..., N, having the general form 


dyj(x) 

dx 


fi(x,yi,...,yN), 




(16.0.3) 


where the functions /* on the right-hand side are known. 

A problem involving ODEs is not completely specified by its equations. Even 
more crucial in determining how to attack the problem numerically is the nature of 
the problem’s boundary conditions. Boundary conditions are algebraic conditions 
on the values of the functions y, in (16.0.3). In general they can be satisfied at 
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discrete specified points, but do not hold between those points, i.e., are not preserved 
automatically by the differential equations. Boundary conditions can be as simple as 
requiring that certain variables have certain numerical values, or as complicated as 
a set of nonlinear algebraic equations among the variables. 

Usually, it is the nature of the boundary conditions that determines which 
numerical methods will be feasible. Boundary conditions divide into two broad 
categories. 

• In initial value problems all the yi are given at some starting value x s , and 
it is desired to find the y.;’s at some final point Xf, or at some discrete list 
of points (for example, at tabulated intervals). 

• In two-point boundary value problems, on the other hand, boundary 
conditions are specified at more than one x. Typically, some of the 
conditions will be specified at x s and the remainder at x/. 

This chapter will consider exclusively the initial value problem, deferring two- 
point boundary value problems, which are generally more difficult, to Chapter 17. 

The underlying idea of any routine for solving the initial value problem is always 
this: Rewrite the dy’ s and dx’s in (16.0.3) as finite steps Ay and Ax, and multiply the 
equations by Ax. This gives algebraic formulas for the change in the functions when 
the independent variable x is “stepped” by one “stepsize” Ax. In the limit of making 
the stepsize very small, a good approximation to the underlying differential equation 
is achieved. Literal implementation of this procedure results in Euler’s method 
(16.1.1, below), which is, however, not recommended for any practical use. Euler’s 
method is conceptually important, however; one way or another, practical methods all 
come down to this same idea: Add small increments to your functions corresponding 
to derivatives (right-hand sides of the equations) multiplied by stepsizes. 

In this chapter we consider three major types of practical numerical methods 
for solving initial value problems for ODEs: 

• Runge-Kutta methods 

• Richardson extrapolation and its particular implementation as the Bulirsch- 
Stoer method 

• predictor-corrector methods. 

A brief description of each of these types follows. 

1. Runge-Kutta methods propagate a solution over an interval by combining 
the information from several Euler-style steps (each involving one evaluation of the 
right-hand /’s), and then using the information obtained to match a Taylor series 
expansion up to some higher order. 

2. Richardson extrapolation uses the powerful idea of extrapolating a computed 
result to the value that would have been obtained if the stepsize had been very 
much smaller than it actually was. In particular, extrapolation to zero stepsize is 
the desired goal. The first practical ODE integrator that implemented this idea was 
developed by Bulirsch and Stoer, and so extrapolation methods are often called 
Bulirsch-Stoer methods. 

3. Predictor-corrector methods store the solution along the way, and use 
those results to extrapolate the solution one step advanced; they then correct the 
extrapolation using derivative information at the new point. These are best for 
very smooth functions. 

Runge-Kutta is what you use when (i) you don’t know any better, or (ii) you 
have an intransigent problem where Bulirsch-Stoer is failing, or (iii) you have a trivial 
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problem where computational efficiency is of no concern. Runge-Kutta succeeds 
virtually always; but it is not usually fastest, except when evaluating /* is cheap and 
moderate accuracy 10 -5 ) is required. Predictor-corrector methods, since they 
use past information, are somewhat more difficult to start up, but, for many smooth 
problems, they are computationally more efficient than Runge-Kutta. In recent years 
Bulirsch-Stoer has been replacing predictor-corrector in many applications, but it 
is too soon to say that predictor-corrector is dominated in all cases. However, it 
appears that only rather sophisticated predictor-corrector routines are competitive. 
Accordingly, we have chosen not to give an implementation of predictor-corrector 
in this book. We discuss predictor-corrector further in §16.7, so that you can use 
a canned routine should you encounter a suitable problem. In our experience, the 
relatively simple Runge-Kutta and Bulirsch-Stoer routines we give are adequate 
for most problems. 

Each of the three types of methods can be organized to monitor internal 
consistency. This allows numerical errors which are inevitably introduced into 
the solution to be controlled by automatic, ( adaptive ) changing of the fundamental 
stepsize. We always recommend that adaptive stepsize control be implemented, 
and we will do so below. 

In general, all three types of methods can be applied to any initial value 
problem. Each comes with its own set of debits and credits that must be understood 
before it is used. 



We have organized the routines in this chapter into three nested levels. The 
lowest or “nitty-gritty” level is the piece we call the algorithm routine. This 
implements the basic formulas of the method, starts with dependent variables y , at 
x, and calculates new values of the dependent variables at the value x + h. The 
algorithm routine also yields up some information about the quality of the solution 
after the step. The routine is dumb, however, and it is unable to make any adaptive 
decision about whether the solution is of acceptable quality or not. 

That quality-control decision we encode in a stepper routine. The stepper 
routine calls the algorithm routine. It may reject the result, set a smaller stepsize, and 
call the algorithm routine again, until compatibility with a predetermined accuracy 
criterion has been achieved. The stepper’s fundamental task is to take the largest 
stepsize consistent with specified performance. Only when this is accomplished does 
the true power of an algorithm come to light. 

Above the stepper is the driver routine, which starts and stops the integration, 
stores intermediate results, and generally acts as an interface with the user. There is 
nothing at all canonical about our driver routines. You should consider them to be 
examples, and you can customize them for your particular application. 

Of the routines that follow, rk4, rkck, mmid, stoerm, and simpr are algorithm 
routines; rkqs, bsstep, stiff, and stifbs are steppers; rkdumb and odeint 
are drivers. 

Section 16.6 of this chapter treats the subject of stiff equations, relevant both to 
ordinary differential equations and also to partial differential equations (Chapter 19). 
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16.1 Runge-Kutta Method 

The formula for the Euler method is 


Vn+l = Vn +hf(x n ,y„) (16.1.1) 

which advances a solution from x n to x n +\ = x n + h. The formula is unsymmetrical: 
It advances the solution through an interval h, but uses derivative information only 
at the beginning of that interval (see Figure 16.1.1). That means (and you can verify 
by expansion in power series) that the step’s error is only one power of h, smaller 
than the correction, i.e 0(h 2 ) added to (16.1.1). 

There are several reasons that Euler’s method is not recommended for practical 
use, among them, (i) the method is not very accurate when compared to other, 
fancier, methods run at the equivalent stepsize, and (ii) neither is it very stable 
(see §16.6 below). 

Consider, however, the use of a step like (16.1.1) to take a “trial” step to the 
midpoint of the interval. Then use the value of both x and y at that midpoint 
to compute the “real” step across the whole interval. Figure 16.1.2 illustrates the 
idea. In equations, 


ki = hf(x n , y n ) 

&2 = hf (x n + \h,y n + 5 & 1 ) (16.1.2) 

Un+l = Vn + k 2 + 0(h 3 ) 


As indicated in the error term, this symmetrization cancels out the first-order error 
term, making the method second order. [A method is conventionally called nth 
order if its error term is 0(h n+l ).] In fact, (16.1.2) is called the second-order 
Runge-Kutta or midpoint method. 

We needn’t stop there. There are many ways to evaluate the right-hand side 
f(x,y) that all agree to first order, but that have different coefficients of higher-order 
error terms. Adding up the right combination of these, we can eliminate the error 
terms order by order. That is the basic idea of the Runge-Kutta method. Abramowitz 
and Stegun [1 ], and Gear [2], give various specific formulas that derive from this basic 
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Xl X 2 X 3 X 


Figure 16.1.1. Euler’s method. In this simplest (and least accurate) method for integrating an ODE, 
the derivative at the starting point of each interval is extrapolated to find the next function value. The 
method has first-order accuracy. 


y{x) 


X\ X2 *3 X 

Figure 16.1.2. Midpoint method. Second-order accuracy is obtained by using the initial derivative at 
each step to find a point halfway across the interval, then using the midpoint derivative across the full 
width of the interval. In the figure, filled dots represent final function values, while open dots represent 
function values that are discarded once their derivatives have been calculated and used. 




idea. By far the most often used is the classical fourth-order Runge-Kutta formula, 
which has a certain sleekness of organization about it: 


ki 

k 3 

2/n+l 


hf(x n ,y n ) 
hf{x n +^,y n 

hf{x n + ^,y n 
hf(x n + h,y n 


ki 
2 ’ 
k2. 


6 3 3 6 


0(h 5 ) 


(16.1.3) 


The fourth-order Runge-Kutta method requires four evaluations of the right- 
hand side per step h (see Figure 16.1.3). This will be superior to the midpoint 
method (16.1.2) if at least twice as large a step is possible with (16.1.3) for the same 
accuracy. Is that so? The answer is: often, perhaps even usually, but surely not 
always! This takes us back to a central theme, namely that high order does not always 
mean high accuracy. The statement “fourth-order Runge-Kutta is generally superior 
to second-order” is a true one, but you should recognize it as a statement about the 
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Figure 16.1.3. Fourth-order Runge-Kutta method. In each step the derivative is evaluated four times: 
once at the initial point, twice at trial midpoints, and once at a trial endpoint. From these derivatives the 
final function value (shown as a filled dot) is calculated. (See text for details.) 

contemporary practice of science rather than as a statement about strict mathematics. 
That is, it reflects the nature of the problems that contemporary scientists like to solve. 

For many scientific users, fourth-order Runge-Kutta is not just the first word on 
ODE integrators, but the last word as well. In fact, you can get pretty far on this old 
workhorse, especially if you combine it with an adaptive stepsize algorithm. Keep 
in mind, however, that the old workhorse’s last trip may well be to take you to the 
poorhouse: Bulirsch-Stoer or predictor-corrector methods can be very much more 
efficient for problems where very high accuracy is a requirement. Those methods 
are the high-strung racehorses. Runge-Kutta is for ploughing the fields. However, 
even the old workhorse is more nimble with new horseshoes. In §16.2 we will give 
a modern implementation of a Runge-Kutta method that is quite competitive as long 
as very high accuracy is not required. An excellent discussion of the pitfalls in 
constructing a good Runge-Kutta code is given in [3], 

Here is the routine for carrying out one classical Runge-Kutta step on a set 
of n differential equations. You input the values of the independent variables, and 
you get out new values which are stepped by a stepsize h (which can be positive or 
negative). You will notice that the routine requires you to supply not only function 
derivs for calculating the right-hand side, but also values of the derivatives at the 
starting point. Why not let the routine call derivs for this first value? The answer 
will become clear only in the next section, but in brief is this: This call may not be 
your only one with these starting conditions. You may have taken a previous step 
with too large a stepsize, and this is your replacement. In that case, you do not 
want to call derivs unnecessarily at the start. Note that the routine that follows 
has, therefore, only three calls to derivs. 

#include "nrutil.h" 

void rk4(float y[], float dydx[], int n, float x, float h, float yout[], 
void (*derivs) (float, float [], float [])) 

Given values for the variables y [1. . n] and their derivatives dydx [1. . n] known at x, use the 
fourth-order Runge-Kutta method to advance the solution over an interval h and return the 
incremented variables as yout[l. ,n], which need not be a distinct array from y. The user 
supplies the routine derivs(x,y,dydx) , which returns derivatives dydx at x. 

{ 

int i; 

float xh,hh,h6,*dym,*dyt,*yt; 

dym=vector(l,n); 
dyt=vector(l,n); 
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> 


yt=vector(1,n); 
hh=h*0.5; 
h6=h/6.0; 
xh=x+hh; 

First step. 
Second step. 

Third step. 


Fourth step. 

Accumulate increments with proper 
yout [i] =y [i] +h6* (dydx [i] +dyt [i] +2.0*dym [i] ); weights. 

free_vector(yt,l,n); 
free_vector(dyt,1,n); 
free_vector(dym,1,n); 


for (i=l; i<=n;i++) yt [i] =y [i]+tLh*dydx[i] ; 
(*derivs)(xh,yt,dyt); 

for (i=l; i<=n;i++) yt [i] =y [i] +tLh*dyt [i] ; 
(♦derivs)(xh,yt,dym); 
for (i=l;i<=n;i++) { 

yt [i] =y [i] +h*dym [i] ; 
dym[i] += dyt[i]; 

> 

(*derivs)(x+h,yt,dyt); 
for (i=l;i<=n;i++) 


The Runge-Kutta method treats every step in a sequence of steps in identical 
manner. Prior behavior of a solution is not used in its propagation. This is 
mathematically proper, since any point along the trajectory of an ordinary differential 
equation can serve as an initial point. The fact that all steps are treated identically also 
makes it easy to incorporate Runge-Kutta into relatively simple “driver” schemes. 

We consider adaptive stepsize control, discussed in the next section, an essential 
for serious computing. Occasionally, however, you just want to tabulate a function at 
equally spaced intervals, and without particularly high accuracy. In the most common 
case, you want to produce a graph of the function. Then all you need may be a 
simple driver program that goes from an initial x s to a final Xf in a specified number 
of steps. To check accuracy, double the number of steps, repeat the integration, and 
compare results. This approach surely does not minimize computer time, and it can 
fail for problems whose nature requires a variable stepsize, but it may well minimize 
user effort. On small problems, this may be the paramount consideration. 

Here is such a driver, self-explanatory, which tabulates the integrated functions 
in the global arrays *x and **y; be sure to allocate memory for them with the 
routines vector () and matrix(), respectively. 


#include "nrutil.h" 

float **y,*xx; For communication back to main. 

void rkdumb(float vstart[], int nvar, float xl, float x2, int nstep, 
void (*derivs)(float, float [] , float [])) 

Starting from initial values vstart [1. .nvar] known at xl use fourth-order Runge-Kutta 
to advance nstep equal increments to x2. The user-supplied routine derivs(x,v,dvdx) 
evaluates derivatives. Results are stored in the global variables y[l. .nvar] [1. .nstep+1] 
and xx[1..nstep+1] . 

{ 

void rk4(float y[] , float dydx[], int n, float x, float h, float yout[], 
void (*derivs)(float, float [], float [])); 
int i,k; 
float x,h; 
float *v, * vout,*dv; 

v=vector(1,nvar); 
vout=vector(1,nvar); 
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dv=vector(1,nvar); 

for (i=l; i<=nvar; i++) { Load starting values. 

v[i]=vstart [i] ; 
y[i] [l]=v[i] ; 

> 

xx[1]=xl; 
x=xl; 

h=(x2-xl)/nstep; 

for (k=l ;k<=nstep;k++) { Take nstep steps. 

(*derivs)(x,v,dv); 

rk4(v,dv,nvar,x,h,vout,derivs); 

if ((float)(x+h) == x) nrerror("Step size too small in routine rkdumb"); 
x += h; 

xx[k+l]=x; Store intermediate steps, 

for (i=l;i<=nvar;i++) { 
v [i] =vout [i] ; 
y [i] [k+1] =v [i] ; 

> 

> 

free_vector(dv,1,nvar); 
free_vector(vout,l,nvar); 
free_vector(v,l,nvar); 


CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), §25.5. [1] 

Gear, C.W. 1971, Numerical Initial Value Problems in Ordinary Differential Equations (Englewood 
Cliffs, NJ: Prentice-Hall), Chapter 2. [2] 

Shampine, L.F., and Watts, H.A. 1977, in Mathematical Software III, J.R. Rice, ed. (New York: Aca¬ 
demic Press), pp. 257-275; 1979, Applied Mathematics and Computation, vol. 5, pp. 93- 
121. [3] 

Rice, J.R. 1983, Numerical Methods, Software, and Analysis! New York: McGraw-Hill), §9.2. 


16.2 Adaptive Stepsize Control for Runge-Kutta 

A good ODE integrator should exert some adaptive control over its own progress, 
making frequent changes in its stepsize. Usually the purpose of this adaptive stepsize 
control is to achieve some predetermined accuracy in the solution with minimum 
computational effort. Many small steps should tiptoe through treacherous terrain, 
while a few great strides should speed through smooth uninteresting countryside. 
The resulting gains in efficiency are not mere tens of percents or factors of two; 
they can sometimes be factors of ten, a hundred, or more. Sometimes accuracy 
may be demanded not directly in the solution itself, but in some related conserved 
quantity that can be monitored. 

Implementation of adaptive stepsize control requires that the stepping algorithm 
signal information about its performance, most important, an estimate of its truncation 
error. In this section we will learn how such information can be obtained. Obviously, 
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the calculation of this information will add to the computational overhead, but the 
investment will generally be repaid handsomely. 

With fourth-order Runge-Kutta, the most straightforward technique by far is 
step doubling (see, e.g., [1]). We take each step twice, once as a full step, then, 
independently, as two half steps (see Figure 16.2.1). How much overhead is this, 
say in terms of the number of evaluations of the right-hand sides? Each of the three 
separate Runge-Kutta steps in the procedure requires 4 evaluations, but the single 
and double sequences share a starting point, so the total is 11. This is to be compared 
not to 4, but to 8 (the two half-steps), since — stepsize control aside — we are 
achieving the accuracy of the smaller (half) stepsize. The overhead cost is therefore 
a factor 1.375. What does it buy us? 

Let us denote the exact solution for an advance from x to x + 2h by y(x + 2 h) 
and the two approximate solutions by y\ (one step 2 h) and y 2 (2 steps each of size 
h ). Since the basic method is fourth order, the true solution and the two numerical 
approximations are related by 

y(x + 2 h) = yi + (2 h) 5 cj) + 0(h 6 ) + ... 

, . (16.2.1) 
y(x + 2 h) = y 2 + 2 (h 5 )rj) + 0(h 6 ) + ... 

where, to order h 5 , the value <f> remains constant over the step. [Taylor series 
expansion tells us the rj> is a number whose order of magnitude is y f5 - ) (x) /5!. ] The 
first expression in (16.2.1) involves (2 h) 5 since the stepsize is 2 h, while the second 
expression involves 2 (h 5 ) since the error on each step is h 5 rj). The difference between 
the two numerical estimates is a convenient indicator of truncation error 

A = y 2 - yi (16.2.2) 

It is this difference that we shall endeavor to keep to a desired degree of accuracy, 
neither too large nor too small. We do this by adjusting h. 

It might also occur to you that, ignoring terms of order h 6 and higher, we can 
solve the two equations in (16.2.1) to improve our numerical estimate of the true 
solution y(x + 2 h), namely, 

y(x + 2h) = y 2 +^ + 0(h 6 ) (16.2.3) 

This estimate is accurate to fifth order, one order higher than the original Runge- 
Kutta steps. However, we can’t have our cake and eat it: (16.2.3) may be fifth-order 
accurate, but we have no way of monitoring its truncation error. Higher order is 
not always higher accuracy! Use of (16.2.3) rarely does harm, but we have no 
way of directly knowing whether it is doing any good. Therefore we should use 
A as the error estimate and take as “gravy” any additional accuracy gain derived 
from (16.2.3). In the technical literature, use of a procedure like (16.2.3) is called 
“local extrapolation.” 

An alternative stepsize adjustment algorithm is based on the embedded Runge- 
Kutta formulas, originally invented by Fehlberg. An interesting fact about Runge- 
Kutta formulas is that for orders M higher than four, more than M function 
evaluations (though never more than M + 2) are required. This accounts for the 
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big step 
two small steps 


x 

Figure 16.2.1. Step-doubling as a means for adaptive stepsize control in fourth-order Runge-Kutta. 
Points where the derivative is evaluated are shown as filled circles. The open circle represents the same 
derivatives as the filled circle immediately above it, so the total number of evaluations is 11 per two steps. 
Comparing the accuracy of the big step with the two small steps gives a criterion for adjusting the stepsize 
on the next step, or for rejecting the current step as inaccurate. 

popularity of the classical fourth-order method: It seems to give the most bang 
for the buck. However, Fehlberg discovered a fifth-order method with six function 
evaluations where another combination of the six functions gives a fourth-order 
method. The difference between the two estimates of y(x + h ) can then be used as 
an estimate of the truncation error to adjust the stepsize. Since Fehlberg’s original 
formula, several other embedded Runge-Kutta formulas have been found. 

Many practitioners were at one time wary of the robustness of Runge-Kutta- 
Fehlberg methods. The feeling was that using the same evaluation points to advance 
the function and to estimate the error was riskier than step-doubling, where the error 
estimate is based on independent function evaluations. However, experience has 
shown that this concern is not a problem in practice. Accordingly, embedded Runge- 
Kutta formulas, which are roughly a factor of two more efficient, have superseded 
algorithms based on step-doubling. 

The general form of a fifth-order Runge-Kutta formula is 

ki = hf(x n , y n ) 

k 2 = hf(x n + a 2 h, y n + b 2 iki) 


k 6 = hf(x n + a e h, y n + b ei ki -\ -b h 5 k 5 ) 

y n +i = Vn + Clfci + c 2 k 2 + C3&3 + C4&4 + c^ks + ceke + 0 {h 6 ) 


(16.2.4) 


The embedded fourth-order formula is 


Vn- 1-1 — Un + clh + C 2&2 + c 3&3 + + Cj^fcs + c^ke + 0(h 5 ) ( 16 . 2 . 5 ) 


and so the error estimate is 


6 

A = y n+ i-y* n+ i=^2,{c i -c*) k i (16.2.6) 

1=1 

The particular values of the various constants that we favor are those found by Cash 
and Karp [2], and given in the accompanying table. These give a more efficient 
method than Fehlberg’s original values, with somewhat better error properties. 
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Cash-Karp Parameters for Embedded Runga-Kutta Method 
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bij 
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5 
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0 
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9 
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40 

40 
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48384 

4 
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3 

9 
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10 

10 
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55296 

K 

1 


11 

5 
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0 
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27 
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44275 

253 

512 

1 

8 


55296 

512 

13824 

110592 

4096 

1771 

4 

3 

= 


i 

2 

3 

4 

5 




Now that we know, at least approximately, what our error is, we need to 
consider how to keep it within desired bounds. What is the relation between A 
and hi According to (16.2.4) - (16.2.5), A scales as h 5 . If we take a step hi 
and produce an error Ai, therefore, the step ho that would have given some other 
value Aq is readily estimated as 


I A I 0 ' 2 

K = hi (16.2.7) 

I A i I 

Henceforth we will let Ao denote the desired accuracy. Then equation (16.2.7) is 
used in two ways: If Ai is larger than Ao in magnitude, the equation tells how 
much to decrease the stepsize when we retry the present (failed) step. If Ai is 
smaller than Ao, on the other hand, then the equation tells how much we can safely 
increase the stepsize for the next step. Local extrapolation consists in accepting 
the fifth order value y n +\, even though the error estimate actually applies to the 
fourth order value y* +1 . 

Our notation hides the fact that Ao is actually a vector of desired accuracies, 
one for each equation in the set of ODEs. In general, our accuracy requirement will 
be that all equations are within their respective allowed errors. In other words, we 
will rescale the stepsize according to the needs of the “worst-offender” equation. 

How is Ao, the desired accuracy, related to some looser prescription like “get a 
solution good to one part in 10 6 ”? That can be a subtle question, and it depends on 
exactly what your application is! You may be dealing with a set of equations whose 
dependent variables differ enormously in magnitude. In that case, you probably 
want to use fractional errors, Aq = ey, where e is the number like 10 “ 6 or whatever. 
On the other hand, you may have oscillatory functions that pass through zero but 
are bounded by some maximum values. In that case you probably want to set A o 
equal to e times those maximum values. 

A convenient way to fold these considerations into a generally useful stepper 
routine is this: One of the arguments of the routine will of course be the vector of 
dependent variables at the beginning of a proposed step. Call that y [1. . n]. Let 
us require the user to specify for each step another, corresponding, vector argument 
yscal [1. .n], and also an overall tolerance level eps. Then the desired accuracy 
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for the ith equation will be taken to be 

Aq = eps x yscal [i] (16.2.8) 

If you desire constant fractional errors, plug a pointer to y into the pointer to yscal 
calling slot (no need to copy the values into a different array). If you desire constant 
absolute errors relative to some maximum values, set the elements of yscal equal to 
those maximum values. A useful “trick” for getting constant fractional errors except 
“very” near zero crossings is to set yscal[i] equal to |y[i]| + \h x dydx[i]|. 
(The routine odeint, below, does this.) 

Here is a more technical point. We have to consider one additional possibility 
for yscal. The error criteria mentioned thus far are “local,” in that they bound the 
error of each step individually. In some applications you may be unusually sensitive 
about a “global” accumulation of errors, from beginning to end of the integration 
and in the worst possible case where the errors all are presumed to add with the 
same sign. Then, the smaller the stepsize h, the smaller the value Ao that you will 
need to impose. Why? Because there will be more steps between your starting 
and ending values of x. In such cases you will want to set yscal proportional to 
h, typically to something like 


Aq = eh x dydx [i] 


(16.2.9) 


This enforces fractional accuracy e not on the values of y but (much more stringently) 
on the increments to those values at each step. But now look back at (16.2.7). If A o 
has an implicit scaling with h, then the exponent 0.20 is no longer correct: When 
the stepsize is reduced from a too-large value, the new predicted value h i will fail to 
meet the desired accuracy when yscal is also altered to this new hi value. Instead 
of 0.20 = 1/5, we must scale by the exponent 0.25 = 1/4 for things to work out. 

The exponents 0.20 and 0.25 are not really very different. This motivates us 
to adopt the following pragmatic approach, one that frees us from having to know 
in advance whether or not you, the user, plan to scale your yseal’s with stepsize. 
Whenever we decrease a stepsize, let us use the larger value of the exponent (whether 
we need it or not!), and whenever we increase a stepsize, let us use the smaller 
exponent. Furthermore, because our estimates of error are not exact, but only 
accurate to the leading order in h, we are advised to put in a safety factor S which is 
a few percent smaller than unity. Equation (16.2.7) is thus replaced by 


ho = 



A 0 > A! 
A 0 < Ai 


(16.2.10) 



We have found this prescription to be a reliable one in practice. 

Here, then, is a stepper program that takes one “quality-controlled” Runge- 
Kutta step. 
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#include <math.h> 

#include "nrutil.h" 

#define SAFETY 0.9 
#define PGROW -0.2 
#define PSHRMK -0.25 
#define ERRCON 1.89e-4 

The value ERRCON equals (5/SAFETY) raised to the power (1/PGROW), see use below. 

void rkqs(float y[], float dydx[], int n, float *x, float htry, float eps, 
float yscal[], float *hdid, float *hnext, 
void (*derivs)(float, float [] , float [])) 

Fifth-order Runge-Kutta step with monitoring of local truncation error to ensure accuracy and 
adjust stepsize. Input are the dependent variable vector y[l. .n] and its derivative dydx [1. .n] 
at the starting value of the independent variable x. Also input are the stepsize to be attempted 
htry, the required accuracy eps, and the vector yscal[l. .n] against which the error is 
scaled. On output, y and x are replaced by their new values, hdid is the stepsize that was 
actually accomplished, and hnext is the estimated next stepsize. derivs is the user-supplied 
routine that computes the right-hand side derivatives. 

{ 

void rkck(float y[], float dydx[], int n, float x, float h, 

float yout[], float yerr[] , void (*derivs)(float, float [], float [])); 
int i; 

float errmax,h,htemp,xnew,*yerr,*ytemp; 

yerr=vector(1,n); 
ytemp=vector(l,n); 

h=htry; Set stepsize to the initial trial value, 

for (;;) { 

rkck(y,dydx,n,*x,h,ytemp,yerr,derivs); Take a step. 

errmax=0.0; Evaluate accuracy. 

for (i=l;i<=n;i++) errmax=FMAX(errmax,fabs(yerr[i] /yscal[i])); 

errmax /= eps; Scale relative to required tolerance. 

if (errmax <= 1.0) break; Step succeeded. Compute size of next step. 

htemp=SAFETY*h*pow(errmax,PSHRNK); 

Truncation error too large, reduce stepsize. 

h=(h >= 0.0 ? FMAX(htemp, 0. l*h) : FMIN(htemp, 0. l*h)); 

No more than a factor of 10. 
xnew=(*x)+h; 

if (xnew == *x) nrerror("stepsize underflow in rkqs"); 

> 

if (errmax > ERRCON) *hnext=SAFETY*h*pow(errmax,PGROW); 

else *hnext=5.0*h; No more than a factor of 5 increase. 

*x += (*hdid=h); 

for (i=l;i<=n;i++) y[i]=ytemp[i]; 
free_vector(ytemp,l,n); 
free_vector(yerr,1,n); 


The routine rkqs calls the routine rkck to take a Cash-Karp Runge-Kutta step: 


#include "nrutil.h" 

void rkck(float y[], float dydx[], int n, float x, float h, float yout[], 
float yerr[], void (*derivs) (float, float [] , float [])) 

Given values for n variables y[l. .n] and their derivatives dydx[l. .n] known at x, use 
the fifth-order Cash-Karp Runge-Kutta method to advance the solution over an interval h 
and return the incremented variables as yout[l. .n]. Also return an estimate of the local 
truncation error in yout using the embedded fourth-order method. The user supplies the routine 
derivs(x,y,dydx), which returns derivatives dydx at x. 

{ 



int i; 
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static float 32=0.2,33=0.3,34=0.6,35=1.0,36=0.875,b21=0.2, 
b31=3.0/40.0, b32=9.0/40.0, b41=0.3, b42 = -0.9,b43=l. 2, 
b51 = -11.0/54.0, b52=2.5,b53 = -70.0/27.0,b54=35.0/27.0, 
b61=1631.0/55296.0,b62=175.0/512.0,b63=575.0/13824.0, 
b64=44275.0/110592.0 ,b65=253.0/4096.0, cl=37.0/378.0, 
c3=250.0/621.0, c4=125.0/594.0, c6=512.0/1771.0, 
dc5 = -277.00/14336.0; 

float dcl=cl-2825.0/27648.0,dc3=c3-18575.0/48384.0, 
dc4=c4-13525.0/55296.0, dc6=c6-0.25; 
float *ak2,*ak3,*ak4,*ak5,*ak6,*ytemp; 

ak2=vector(l,n); 
ak3=vector(l,n); 
ak4=vector(l,n); 
ak5=vector(l,n); 
ak6=vector(l,n); 
ytemp=vector(l,n); 

for (i=l;i<=n;i++) First step. 

ytemp[i]=y[i]+b21*h*dydx [i]; 

(*derivs)(x+a2*h,ytemp,ak2); Second step, 

for (i=l;i<=n;i++) 

ytemp [i]=y[i]+h*(b31*dydx[i] +b32*ak2 [i] ) ; 

(*derivs) (x+a3*h,ytemp,ak3) ; Third step, 

for (i=l;i<=n;i++) 

ytemp [i]=y[i]+h*(b41*dydx[i] +b42*ak2 [i] +b43*ak3[i]); 

(*derivs) (x+a4*h,ytemp,ak4) ; Fourth step, 

for (i=l;i<=n;i++) 

ytemp [i] =y [i]+h* (b51*dydx[i] +b52*ak2 [i] +b53*ak3[i]+b54*ak4[i] ) ; 

(*derivs) (x+a5*h,ytemp,ak5) ; Fifth step, 

for (i=l;i<=n;i++) 

ytemp [i]=y[i]+h*(b61*dydx[i] +b62*ak2 [i] +b63*ak3[i]+b64*ak4[i]+b65*ak5[i]); 
(*derivs) (x+a6*h,ytemp,ak6) ; Sixth step. 

for (i=l;i<=n;i++) Accumulate increments with proper weights. 

yout [i]=y [i]+h*(cl*dydx[i]+c3*ak3[i]+c4*ak4[i] +c6*ak6 [i] ); 
for (i=l;i<=n;i++) 

yerr[i]=h*(dcl*dydx[i]+dc3*ak3[i]+dc4*ak4[i]+dc5*ak5[i]+dc6*ak6[i]); 
Estimate error as difference between fourth and fifth order methods. 
free_vector(ytemp,l,n); 
free_vector(ak6,1,n); 
free_vector(ak5,l,n); 
free_vector(ak4,1,n); 
free_vector(ak3,1,n); 
free_vector(ak2,l,n); 


Noting that the above routines are all in single precision, don’t be too greedy in 
specifying eps. The punishment for excessive greediness is interesting and worthy of 
Gilbert and Sullivan’s Mikado: The routine can always achieve an apparent zero error 
by making the stepsize so small that quantities of order hy' add to quantities of order 
y as if they were zero. Then the routine chugs happily along taking infinitely many 
infinitesimal steps and never changing the dependent variables one iota. (You guard 
against this catastrophic loss of your computer budget by signaling on abnormally 
small stepsizes or on the dependent variable vector remaining unchanged from step 
to step. On a personal workstation you guard against it by not taking too long a 
lunch hour while your program is running.) 

Here is a full-fledged “driver” for Runge-Kutta with adaptive stepsize control. 
We warmly recommend this routine, or one like it, for a variety of problems, notably 
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including garden-variety ODEs or sets of ODEs, and definite integrals (augmenting 
the methods of Chapter 4). For storage of intermediate results (if you desire to 
inspect them) we assume that the top-level pointer references *xp and **yp have been 
validly initialized (e.g., by the utilities vectorO and matrixO). Because steps 
occur at unequal intervals results are only stored at intervals greater than dxsav. The 
top-level variable kmax indicates the maximum number of steps that can be stored. 
If kmax=0 there is no intermediate storage, and the pointers *xp and **yp need not 
point to valid memory. Storage of steps stops if kmax is exceeded, except that the 
ending values are always stored. Again, these controls are merely indicative of what 
you might need. The routine odeint should be customized to the problem at hand. 

#include <math.h> 

#include "nrutil.h" 

#define MAXSTP 10000 
#define TINY 1.0e-30 

extern int kmax,kount; 
extern float *xp,**yp,dxsav; 

User storage for intermediate results. Preset kmax and dxsav in the calling program. If kmax yf 
0 results are stored at approximate intervals dxsav in the arrays xp [1. .kount] , yp [1. .nvar] 
[1. .kount] , where kount is output by odeint. Defining declarations for these variables, with 
memory allocations xp[l. .kmax] and yp[l. .nvar] [1. .kmax] for the arrays, should be in 
the calling program. 

void odeint(float ystart [], int nvar, float xl, float x2, float eps, float hi, 
float hmin, int *nok, int *nbad, 
void (*derivs)(float, float [] , float []), 

void (*rkqs) (float [], float [], int, float *, float, float, float [], 
float *, float *, void (*) (float, float [], float □))) 

Runge-Kutta driver with adaptive stepsize control. Integrate starting values ystart [1. .nvar] 
from xl to x2 with accuracy eps, storing intermediate results in global variables, hi should 
be set as a guessed first stepsize, hmin as the minimum allowed stepsize (can be zero). On 
output nok and nbad are the number of good and bad (but retried and fixed) steps taken, and 
ystart is replaced by values at the end of the integration interval, derivs is the user-supplied 
routine for calculating the right-hand side derivative, while rkqs is the name of the stepper 
routine to be used. 

{ 

int nstp,i; 

float xsav,x,hnext,hdid,h; 
float *yscal,*y,*dydx; 

yscal=vector(l,nvar); 
y=vector(1,nvar); 
dydx=vector(1,nvar); 
x=xl; 

h=SIGN(hl,x2-xl); 

*nok = (*nbad) = kount = 0; 

for (i=l;i<=nvar;i++) y[i]=ystart[i]; 

if (kmax > 0) xsav=x-dxsav*2.0; Assures storage of first step, 

for (nstp=l;nstp<=MAXSTP;nstp++) { Take at most MAXSTP steps. 

(*derivs)(x,y,dydx); 
for (i=l;i<=nvar;i++) 

Scaling used to monitor accuracy. This general-purpose choice can be modified 
if need be. 

yscal [i]=fabs(y[i])+fabs(dydx[i] *h)+TINY; 
if (kmax > 0 kk kount < kmax-1 kk fabs(x-xsav) > fabs(dxsav)) { 
xp [++kount] =x; Store intermediate results, 

for (i=l;i<=nvar;i++) yp[i][kount]=y [i]; 
xsav=x; 

> 

if ((x+h-x2)*(x+h-xl) > 0.0) h=x2-x; If stepsize can overshoot, decrease. 
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(*rkqs)(y,dydx,nvar,&x,h,eps,ys cal,fehdid,fehnext,derivs); 
if (hdid == h) ++(*nok); else ++(*nbad); 
if ((x-x2)*(x2-xl) >=0.0) { Are we done? 

for (i=l;i<=nvar;i++) ystart[i]=y [i]; 
if (kmax) { 

xp [++kount] =x; Save final step, 

for (i=l; i<=nvar; i++) yp[i] [kount]=y[i] ; 

> 

free_vector(dydx,l,nvar); 
free_vector(y,1,nvar); 
free_vector(yscal,1,nvar); 

return; Normal exit. 

> 

if (fabs(hnext) <= hmin) nrerror("Step size too small in odeint 11 ); 
h=hnext; 

> 

nrerrorO'Too many steps in routine odeint"); 


CITED REFERENCES AND FURTHER READING: 

Gear, C.W. 1971, Numerical Initial Value Problems in Ordinary Differential Equations (Englewood 
Cliffs, NJ: Prentice-Hall). [1] 

Cash, J.R., and Karp, A.H. 1990, ACM Transactions on Mathematical Software, vol. 16, pp. 201- 
222 . [ 2 ] 

Shampine, L.F., and Watts, H.A. 1977, in Mathematical Software III, J.R. Rice, ed. (New York: Aca¬ 
demic Press), pp. 257-275; 1979, Applied Mathematics and Computation, vol. 5, pp. 93- 
121 . 

Forsythe, G.E., Malcolm, M.A., and Moler, C.B. 1977, Computer Methods for Mathematical 
Computations (Englewood Cliffs, NJ: Prentice-Hall). 


16.3 Modified Midpoint Method 

This section discusses the modified midpoint method, which advances a vector 
of dependent variables y(x) from a point x to a point x + H by a sequence of n 
substeps each of size h, 

h = H/n (16.3.1) 

In principle, one could use the modified midpoint method in its own right as an ODE 
integrator. In practice, the method finds its most important application as a part of 
the more powerful Bulirsch-Stoer technique, treated in §16.4. You can therefore 
consider this section as a preamble to §16.4. 

The number of right-hand side evaluations required by the modified midpoint 
method is n 4- 1. The formulas for the method are 
z 0 = y(x) 

zi = z 0 + hf(x, zo) 



Zm+i = z m -i + 2hf(x + mh, z m ) for m = 1,2,..., n — 1 
y(x + H) w y n = i [z n + z n -i + hf(x + H, z n )\ 


(16.3.2) 
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Here the z’s are intermediate approximations which march along in steps of h, while 
y n is the final approximation to y(x + H). The method is basically a “centered 
difference” or “midpoint” method (compare equation 16.1.2), except at the first and 
last points. Those give the qualifier “modified.” 

The modified midpoint method is a second-order method, like (16.1.2), but with 
the advantage of requiring (asymptotically for large n) only one derivative evaluation 
per step h instead of the two required by second-order Runge-Kutta. Perhaps there 
are applications where the simplicity of (16.3.2), easily coded in-line in some other 
program, recommends it. In general, however, use of the modified midpoint method 
by itself will be dominated by the embedded Runge-Kutta method with adaptive 
stepsize control, as implemented in the preceding section. 

The usefulness of the modified midpoint method to the Bulirsch-Stoer technique 
(§16.4) derives from a “deep” result about equations (16.3.2), due to Gragg. It turns 
out that the error of (16.3.2), expressed as a power series in h, the stepsize, contains 
only even powers of h. 


Un ~ y(x + H) = ^2 a i h2% (16.3.3) 

i= 1 

where H is held constant, but h changes by varying n in (16.3.1). The importance 
of this even power series is that, if we play our usual tricks of combining steps to 
knock out higher-order error terms, we can gain two orders at a time! 

For example, suppose n is even, and let y n / 2 denote the result of applying 
(16.3.1) and (16.3.2) with half as many steps, n —> n/2. Then the estimate 

i+ (16.3.4) 

is fourth-order accurate, the same as fourth-order Runge-Kutta, but requires only 
about 1.5 derivative evaluations per step h instead of Runge-Kutta’s 4 evaluations. 
Don’t be too anxious to implement (16.3.4), since we will soon do even better. 

Now would be a good time to look back at the routine qsimp in §4.2, and 
especially to compare equation (4.2.4) with equation (16.3.4) above. You will see 
that the transition in Chapter 4 to the idea of Richardson extrapolation, as embodied 
in Romberg integration of §4.3, is exactly analogous to the transition in going from 
this section to the next one. 

Here is the routine that implements the modified midpoint method, which will 
be used below. 

#include "nrutil.h" 

void mmid(float y[], float dydx[], int nvar, float xs, float htot, int nstep, 
float yout[], void (*derivs)(float, float [], float □)) 

Modified midpoint step. Atxs, input the dependent variable vector y [1. .nvar] and its deriva¬ 
tive vector dydx[l. .nvar] . Also input is htot, the total step to be made, and nstep, the 
number of substeps to be used. The output is returned as yout[l. .nvar] , which need not 
be a distinct array from y; if it is distinct, however, then y and dydx are returned undamaged. 
{ 

int n,i; 

float x,swap,h2,h,*ym,*yn; 
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> 


ym=vector(1,nvar); 
yn=vector(1,nvar); 
h=htot/nstep; 
for (i=l;i<=nvar;i++) { 
ym[i]=y[i] ; 
yn [i] =y [i] +h*dydx [i] ; 

> 

x=xs+h; 

(♦derivs) (x,yn,yout); 
h2=2.0*h; 

for (n=2;n<=nstep;n++) { 
for (i=l;i<=nvar;i++) { 
swap=ym[i]+h2*yout [i]; 
ym[i]=yn[i] ; 
yn [i] =swap; 

> 

x += h; 

(*derivs)(x,yn,yout); 


Stepsize this trip. 


First step. 


Will use yout for temporary storage of deriva¬ 
tives. 

General step. 


for (i=l; i<=nvar; i++) Last step. 

yout [i] =0. 5* (ym [i] +yn [i] +h*yout [i]); 
free_vector(yn,1,nvar); 
free_vector(ym,1,nvar); 


CITED REFERENCES AND FURTHER READING: 

Gear, C.W. 1971, Numerical Initial Value Problems in Ordinary Differential Equations (Englewood 
Cliffs, NJ: Prentice-Hall), §6.1.4. 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
§7.2.12. 


16.4 Richardson Extrapolation and the 
Bulirsch-Stoer Method 

The techniques described in this section are not for differential equations 
containing nonsmooth functions. For example, you might have a differential 
equation whose right-hand side involves a function that is evaluated by table look-up 
and interpolation. If so, go back to Runge-Kutta with adaptive stepsize choice: 
That method does an excellent job of feeling its way through rocky or discontinuous 
terrain. It is also an excellent choice for quick-and-dirty, low-accuracy solution 
of a set of equations. A second warning is that the techniques in this section are 
not particularly good for differential equations that have singular points inside the 
interval of integration. A regular solution must tiptoe very carefully across such 
points. Runge-Kutta with adaptive stepsize can sometimes effect this; more generally, 
there are special techniques available for such problems, beyond our scope here. 

Apart from those two caveats, we believe that the Bulirsch-Stoer method, 
discussed in this section, is the best known way to obtain high-accuracy solutions 
to ordinary differential equations with minimal computational effort. (A possible 
exception, infrequently encountered in practice, is discussed in §16.7.) 
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Figure 16.4.1. Richardson extrapolation as used in the Bulirsch-Stoer method. A large interval H is 
spanned by different sequences of finer and finer substeps. Their results are extrapolated to an answer 
that is supposed to correspond to infinitely fine substeps. In the Bulirsch-Stoer method, the integrations 
are done by the modified midpoint method, and the extrapolation technique is rational function or 
polynomial extrapolation. 


Three key ideas are involved. The first is Richardson’s deferred approach 
to the limit, which we already met in §4.3 on Romberg integration. The idea is 
to consider the final answer of a numerical calculation as itself being an analytic 
function (if a complicated one) of an adjustable parameter like the stepsize h. That 
analytic function can be probed by performing the calculation with various values 
of h, none of them being necessarily small enough to yield the accuracy that we 
desire. When we know enough about the function, we fit it to some analytic form, 
and then evaluate it at that mythical and golden point h = 0 (see Figure 16.4.1). 
Richardson extrapolation is a method for turning straw into gold! (Lead into gold 
for alchemist readers.) 

The second idea has to do with what kind of fitting function is used. Bulirsch and 
Stoer first recognized the strength of rational function extrapolation in Richardson- 
type applications. That strength is to break the shackles of the power series and its 
limited radius of convergence, out only to the distance of the first pole in the complex 
plane. Rational function fits can remain good approximations to analytic functions 
even after the various terms in powers of h all have comparable magnitudes. In 
other words, h can be so large as to make the whole notion of the “order” of the 
method meaningless — and the method can still work superbly. Nevertheless, more 
recent experience suggests that for smooth problems straightforward polynomial 
extrapolation is slightly more efficient than rational function extrapolation. We will 
accordingly adopt polynomial extrapolation as the default, but the routine bsstep 
below allows easy substitution of one kind of extrapolation for the other. You 
might wish at this point to review §3.1—§3.2, where polynomial and rational function 
extrapolation were already discussed. 

The third idea was discussed in the section before this one, namely to use 
a method whose error function is strictly even, allowing the rational function or 
polynomial approximation to be in terms of the variable h 2 instead of just h. 

Put these ideas together and you have the Bulirsch-Stoer method [1 ]. A single 
Bulirsch-Stoer step takes us from x to x + H, where H is supposed to be quite a large 
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— not at all infinitesimal — distance. That single step is a grand leap consisting 
of many (e.g., dozens to hundreds) substeps of modified midpoint method, which 
are then extrapolated to zero stepsize. 

The sequence of separate attempts to cross the interval H is made with increasing 
values of n, the number of substeps. Bulirsch and Stoer originally proposed the 
sequence 

n = 2,4,6,8,12,16,24,32,48,64,96,... , [rij = 2n j _ 2 ],... (16.4.1) 

More recent work by Deuflhard [2,3] suggests that the sequence 

n = 2,4,6,8,10,12,14,... , [rij = 2 j],... (16.4.2) 

is usually more efficient. For each step, we do not know in advance how far up this 
sequence we will go. After each successive n is tried, a polynomial extrapolation 
is attempted. That extrapolation gives both extrapolated values and error estimates. 
If the errors are not satisfactory, we go higher in n. If they are satisfactory, we go 
on to the next step and begin anew with n = 2. 

Of course there must be some upper limit, beyond which we conclude that there 
is some obstacle in our path in the interval H, so that we must reduce H rather than 
just subdivide it more finely. In the implementations below, the maximum number 
of n’s to be tried is called KMAXX. For reasons described below we usually take this 
equal to 8 ; the 8 th value of the sequence (16.4.2) is 16, so this is the maximum 
number of subdivisions of H that we allow. 

We enforce error control, as in the Runge-Kutta method, by monitoring internal 
consistency, and adapting stepsize to match a prescribed bound on the local truncation 
error. Each new result from the sequence of modified midpoint integrations allows a 
tableau like that in §3.1 to be extended by one additional set of diagonals. The size of 
the new correction added at each stage is taken as the (conservative) error estimate. 
How should we use this error estimate to adjust the stepsize? The best strategy now 
known is due to Deuflhard [2,3]. For completeness we describe it here: 

Suppose the absolute value of the error estimate returned from the fcth column (and hence 
the k + 1st row) of the extrapolation tableau is tk+i,k- Error control is enforced by requiring 

(k~\,k < e (16.4.3) 

as the criterion for accepting the current step, where e is the required tolerance. For the even 
sequence (16.4.2) the order of the method is 2k + 1: 

e k +i lfc ~ H 2k+1 (16.4.4) 

Thus a simple estimate of a new stepsize Hk to obtain convergence in a fixed column k would be 

/ , \ V(2fc+1) 

H k = H( — - (16.4.5) 

V e fc+i,fc / 

Which column k should we aim to achieve convergence in? Let’s compare the work 
required for different k. Suppose Ak is the work to obtain row k of the extrapolation tableau, 
so Ak+ 1 is the work to obtain column k. We will assume the work is dominated by the cost 
of evaluating the functions defining the right-hand sides of the differential equations. For nk 
subdivisions in H, the number of function evaluations can be found from the recurrence 



(16.4.6) 


■ll — Til -(- 1 
Ak+i = Ak + rik- 
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The work per unit step to get column k is Ak+i/H k , which we nondimensionalize with a 
factor of H and write as 


A k +i 
= A fc+1 (^) 1 


Wk = 

Uk 


(16.4.7) 

(16.4.8) 


The quantities Wk can be calculated during the integration. The optimal column index q 
is then defined by 


W g = min W k (16.4.9) 

k=l,...,kf 

where kf is the final column, in which the error criterion (16.4.3) was satisfied. The q 
determined from (16.4.9) defines the stepsize H q to be used as the next basic stepsize, so that 
we can expect to get convergence in the optimal column q. 

Two important refinements have to be made to the strategy outlined so far: 

• If the current H is “too small,” then kf will be “too small,” and so q remains 
“too small.” It may be desirable to increase H and aim for convergence in a 
column q > kf. 

• If the current H is “too big,” we may not converge at all on the current step and we 
will have to decrease H. We would like to detect this by monitoring the quantities 
tk+i,k for each k so we can stop the current step as soon as possible. 

Deuflhard’s prescription for dealing with these two problems uses ideas from communi¬ 
cation theory to determine the “average expected convergence behavior” of the extrapolation. 
His model produces certain correction factors a(k, q) by which Hk is to be multiplied to try 
to get convergence in column q. The factors a(k, q) depend only on e and the sequence {«,;} 
and so can be computed once during initialization: 

• 

a(k,q) = for k < q (16.4.10) 


with a(q, gj: ;sse 1. 

Now to handle the first problem, suppose convergence occurs in column q = kf. Then 
rather than taking H q for the next step, we might aim to increase the stepsize to get convergence 
in column q+ 1. Since we don’t have H q+ 1 available from the computation, we estimate it as 

H q+1 = H q a(q,q + 1) (16.4.11) 


By equation (16.4.7) this replacement is efficient, i.e., reduces the work per unit step, if 

(16.4.12) 

(16.4.13) 


Aq +1 > A q+ 2 


H q H q+ 1 

A q+ ia(q,q+ 1) > A q+2 


During initialization, this inequality can be checked for q = 1,2,... to determine fcmax, the 
largest allowed column. Then when (16.4.12) is satisfied it will always be efficient to use 
H q+ 1 . (In practice we limit fc max to 8 even when e is very small as there is very little further 
gain in efficiency whereas roundoff can become a problem.) 

The problem of stepsize reduction is handled by computing stepsize estimates 

H k = H k a(k,q), k = l,...,q-l (16.4.14) 

during the current step. The H’s are estimates of the stepsize to get convergence in the optimal 
column q. If any H k is “too small,” we abandon the current step and restart using H k . The 
criterion of being “too small” is taken to be 


H k a{k, q + l)<H 



The a’s satisfy a(k,q + 1) > a(k,q). 


(16.4.15) 
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During the first step, when we have no information about the solution, the stepsize 
reduction check is made for all k. Afterwards, we test for convergence and for possible 
stepsize reduction only in an “order window” 

max(l, q — 1) < A; < min(fc max , q + 1) (16.4.16) 


The rationale for the order window is that if convergence appears to occur for k < r/ — 1 it 
is often spurious, resulting from some fortuitously small error estimate in the extrapolation. 
On the other hand, if you need to go beyond k = q + 1 to obtain convergence, your local 
model of the convergence behavior is obviously not very good and you need to cut the 
stepsize and reestablish it. 

In the routine bsstep, these various tests are actually carried out using quantities 


= JltL f «e%*P% /(2fc+1) 

H k m 


(16.4.17) 


called err [k] in the code. As usual, we include a “safety factor” in the stepsize selection. 
This is implemented by replacing e by 0.25e. Other safety factors are explained in the 
program comments. 

Note that while the optimal convergence column is restricted to increase by at most one 
on each step, a sudden drop in order is allowed by equation (16.4.9). This gives the method 
a degree of robustness for problems with discontinuities. 


Let us remind you once again that scaling of the variables is often crucial for 
successful integration of differential equations. The scaling “trick” suggested in 
the discussion following equation (16.2.8) is a good general purpose choice, but 
not foolproof. Scaling by the maximum values of the variables is more robust, but 
requires you to have some prior information. 

The following implementation of a Bulirsch-Stoer step has exactly the same 
calling sequence as the quality-controlled Runge-Kutta stepper rkqs. This means 
that the driver odeint in §16.2 can be used for Bulirsch-Stoer as well as Runge- 
Kutta: Just substitute bsstep for rkqs in odeint’s argument list. The routine 
bsstep calls mmid to take the modified midpoint sequences, and calls pzextr, given 
below, to do the polynomial extrapolation. 


#include <math.h> 
#include "nrutil.h" 


#define KMAXX 8 
#define IMAXX (KMAXX+1) 
#define SAFE1 0.25 
#define SAFE2 0.7 
#define REDMAX 1.0e-5 
#define REDMIN 0.7 
#define TINY 1.0e-30 
#define SCALMX 0.1 


Maximum row number used in the extrapola¬ 
tion. 

Safety factors. 

Maximum factor for stepsize reduction. 
Minimum factor for stepsize reduction. 

Prevents division by zero. 

1/SCALMX is the maximum factor by which a 
stepsize can be increased. 


float **d,*x; 

Pointers to matrix and vector used by pzextr or rzextr. 


void bsstep(float y[], float dydx[], int nv, float *xx, float htry, float eps, 
float yscal[], float *hdid, float *hnext, 
void (*derivs) (float, float [] , float [])) 

Bulirsch-Stoer step with monitoring of local truncation error to ensure accuracy and adjust 
stepsize. Input are the dependent variable vector y[l. .nv] and its derivative dydx[l. .nv] 
at the starting value of the independent variable x. Also input are the stepsize to be attempted 
htry, the required accuracy eps, and the vector yscal[l. .nv] against which the error is 
scaled. On output, y and x are replaced by their new values, hdid is the stepsize that was 
actually accomplished, and hnext is the estimated next stepsize. derivs is the user-supplied 
routine that computes the right-hand side derivatives. Be sure to set htry on successive steps 
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to the value of hnext returned from the previous step, as is the case if the routine is called 
by odeint. 

{ 

void mmid(float y[], float dydx[], int nvar, float xs, float htot, 

int nstep, float yout[], void (*derivs) (float, float[], float[])); 
void pzextr(int iest, float xest, float yest [] , float yz[], float dy[] , 
int nv); 

int i,iq,k,kk,km; 

static int first=l,kmax,kopt; 

static float epsold = -1.0,xnew; 

float eps1,errmax,fact,h,red,scale,work,wrkmin,xest; 

float *err,*yerr,*ysav,*yseq; 

static float a[IMAXX+l]; 

static float alf [KMAXX+1][KMAXX+1]; 

static int nseq[IMAXX+l]={0,2,4,6,8,10,12,14,16,18}; 

int reduct,exitflag=0; 


d=matrix(1,nv,1,KMAXX); 
err=vector(l,KMAXX); 
x=vector(1,KMAXX); 
yerr=vector(1,nv); 
ysav=vector(1,nv); 
yseq=vector(1,nv); 

if (eps != epsold) { A new tolerance, so reinitialize. 

*hnext = xnew = -1.0e29; “Impossible" values. 

epsl=SAFEl*eps; 

a[l]=nseq[l]+l; Compute work coefficients A *.. 

for (k=l;k<=KMAXX;k++) a[k+l]=a[k]+nseq[k+l]; 
for (iq=2; iq<=KMAXX; iq++) { Compute ot(k,q). 

for (k=l;k<iq;k++) 

alf [k] [iq]=pow(epsl, (a[k+l]-a[iq+l] )/ 

((a[iq+l] -a[l] +1.0)*(2*k+l))) ; 

> 

epsold=eps; 

for (kopt=2;kopt<KMAXX;kopt++) Determine optimal row number for 

if (a[kopt+l] > a[kopt] *alf [kopt-1] [kopt]) break; convergence. 

kmax=kopt; 


Save the starting values. 

A new stepsize or a new integration: 
re-establish the order window. 


h=htry; 

for (i=l;i<=nv;i++) ysav[i]=y[i] ; 
if (*xx != xnew I I h != (*hnext)) { 
first=l; 
kopt=kmax; 

> 

reduct=0; 
for (;;) { 

for (k=l;k<=kmax;k++) { Evaluate the sequence of modified 

xnew=(*xx)+h; midpoint integrations, 

if (xnew == (*xx)) nrerror("step size underflow in bsstep"); 
mmid(ysav,dydx,nv,*xx,h,nseq[k],yseq,derivs); 

xest=SQR(h/nseq[k]); Squared, since error series is even. 

pzextr(k ) xest,yseq,y,yerr,nv) ; Perform extrapolation, 

if (k != 1) { Compute normalized error estimate 

errmax=TINY; e(fc). 

for (i=l;i<=nv;i++) errmax=FMAX(errmax,fabs(yerr[i]/yscal[i])); 
errmax /= eps; Scale error relative to tolerance. 

km=k-l; 


err[km]=pow(errmax/SAFEl,1.0/(2*km+l)); 

> 

if (k != 1 && (k >= kopt-1 || first)) { In order window, 

if (errmax < 1.0) { Converged, 

exitflag=l; 
break; 
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The polynomial extrapolation routine is based on the same algorithm as polint 
§3.1. It is simpler in that it is always extrapolating to zero, rather than to an arbitrary 
value. However, it is more complicated in that it must individually extrapolate each 
component of a vector of quantities. 
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#include "nrutil.h" 

extern float **d,*x; Defined in bsstep. 

void pzextr(int iest, float xest, float yest[], float yz[], float dy[], int nv) 
Use polynomial extrapolation to evaluate nv functions at x = 0 by fitting a polynomial to a 
sequence of estimates with progressively smaller values x = xest, and corresponding function 
vectors yest [1. .nv] . This call is number iest in the sequence of calls. Extrapolated function 
values are output as yz[l. .nv] , and their estimated error is output as dy[l. .nv] . 

{ 

int kl,j; 

float q,f2,f1,delta,*c; 
c=vector(l,nv); 

x[iest]=xest; Save current independent variable, 

for (j=l; j<=nv; j++) dy [j] =yz[j] =yest [j] ; 

if (iest == 1) { Store first estimate in first column, 

for (j=l; j<=nv; j++) d[j] [1] =yest [j] ; 

> else { 

for (j=l; j<=nv; j++) c[j] =yest[j] ; 
for (kl=l; kKiest; kl++) { 

delta=l.0/(x[iest-kl]-xest); 
fl=xest*delta; 
f2=x[iest-kl]*delta; 

for (j=l;j<=nv;j++) { Propagate tableau 1 diagonal more. 

q=d[j] [kl] ; 
d[j] [kl] =dy [j] ; 
delta=c [j] -q; 
dy[j]=f l*delta; 
c [j]=f2*delta; 
yz[j] += dy[j] ; 

> 

> 

for (j=l; j<=nv; j++) d[j] [iest]=dy[j] ; 

> 

free_vector(c,l,nv); 

> 


Current wisdom favors polynomial extrapolation over rational function extrap¬ 
olation in the Bulirsch-Stoer method. However, our feeling is that this view is guided 
more by the kinds of problems used for tests than by one method being actually 
“better.” Accordingly, we provide the optional routine rzextr for rational function 
extrapolation, an exact substitution for pzextr above. 

#include "nrutil.h" 

extern float **d,*x; Defined in bsstep. 

void rzextr(int iest, float xest, float yest[], float yz[], float dy[], int nv) 
Exact substitute for pzextr, but uses diagonal rational function extrapolation instead of poly¬ 
nomial extrapolation. 

{ 

int k,j; 

float yy,v,ddy,c,bl,b,*fx; 
fx=vector(1,iest); 

x [iest] =xest; Save current independent variable, 

if (iest == 1) 

for (j=l;j<=nv;j++) { 
yz [j] =yest [j] ; 
d[j] [l]=yest[j] ; 
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dy[j]=yest [j] ; 

> 

else { 

for (k=l;k<iest;k++) 

fx[k+1]=x[iest-k]/xest; 

for (j=l; j<=nv; j++) { Evaluate next diagonal in tableau. 

v=d[j] [1] ; 

d[j] [l]=yy=c=yest[j] ; 
for (k=2;k<=iest;k++) { 
bl=fx [k] *v; 
b=bl-c; 
if (b) { 

b=(c-v)/b; 
ddy=c*b; 
c=bl*b; 

> else Care needed to avoid division by 0. 

ddy=v; 

if (k ! = iest) v=d[j] [k] ; 
d[j] [k] =ddy; 
yy += ddy; 

> 

dy[j]=ddy; 

yz[j] = yy; 

> 

> 

free_vector(fx,1,iest); 


CITED REFERENCES AND FURTHER READING: 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
§7.2.14. [1] 

Gear, C.W. 1971, Numerical Initial Value Problems in Ordinary Differential Equations (Englewood 
Cliffs, NJ: Prentice-Hall), §6.2. 

Deuflhard, P. 1983, Numerische Mathematik, vol. 41, pp. 399-422. [2] 

Deuflhard, P. 1985, SIAM Review , vol. 27, pp. 505-535. [3] 


16.5 Second-Order Conservative Equations 


Usually when you have a system of high-order differential equations to solve it is best 
to reformulate them as a system of first-order equations, as discussed in §16.0. There is 
a particular class of equations that occurs quite frequently in practice where you can gain 
about a factor of two in efficiency by differencing the equations directly. The equations are 
second-order systems where the derivative does not appear on the right-hand side: 

y" = f(x,y), y(x 0 ) = yo, y'{x 0 ) = z 0 (16.5.1) 

As usual, y can denote a vector of values. 

Stoermer’s rule, dating back to 1907, has been a popular method for discretizing such 
systems. With h = H/m we have 

2/i = 2/o + h[z 0 + \hf{x 0 , 2 / 0 )] 

2/fc+i - 22 ik + Vk-i = h 2 f(x o + kh, y k ), k=l,...,m-l 
Zm = (2/m - ym-i)/h + \hf(x 0 + H, y m ) 



(16.5.2) 
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Here z m is y'(x o + H). Henrici showed how to rewrite equations (16.5.2) to reduce roundoff 
error by using the quantities A*. = ijk+i — Uk- Start with 


A 0 = h[z 0 + \hf(x 0 ,yo)] 
yi = yo + A 0 


(16.5.3) 


Then for k = 1,..., m — 1, set 


A fc = A k -i + h 2 f(xo + kh, yk) 
Vk+i = Vk + A fc 


(16.5.4) 


Finally compute the derivative from 

z m = Am-i/h + \hf{x 0 + H,y m ) (16.5.5) 

Gragg again showed that the error series for equations (16.5.3)—(16.5.5) contains only 
even powers of h, and so the method is a logical candidate for extrapolation a la Bulirsch-Stoer. 
We replace mmid by the following routine stoerm: 


#include "nrutil.h" 



void stoerm(float y[], float d2y[] , int nv, float xs, float htot, int nstep, 
float yout[], void (*derivs)(float, float [] , float [])) 

Stoermer's rule for integrating y" = f(x,y) for a system of n = nv/2 equations. On input 
y[l. .nv] contains y in its first n elements and y' in its second n elements, all evaluated at 
xs. d2y[l. .nv] contains the right-hand side function / (also evaluated at xs) in its first n 
elements. Its second n elements are not referenced. Also input is htot, the total step to be 
taken, and nstep, the number of substeps to be used. The output is returned as yout [1. .nv] , 
with the same storage arrangement as y. derivs is the user-supplied routine that calculates /. 
{ 

int i,n,neqns,nn; 

float h,h2,halfh ) x,*ytemp; 



ytemp=vector(l,nv); 

h=htot/nstep; Stepsize this trip. 

halfh=0.5*h; 

neqns=nv/2; Number of equations, 

for (i=l; i<=neqns; i++) { First step. 

n=neqns+i; 

ytemp [i] =y [i] + (ytemp [n] =h* (y [n] +halfh*d2y [i])) ; 

| 

x=xs+h; 

(*derivs) (x,ytemp,yout); Use yout for temporary storage of derivatives. 

h2=h*h; 

for (nn=2;nn<=nstep;nn++) { General step, 

for (i=l;i<=neqns;i++) 

ytemp[i] += (ytemp[(n=neqns+i)] += h2*yout[i]); 
x += h; 

(*derivs)(x,ytemp,yout); 

> 

for (i=l; i<=neqns; i++) { Last step. 

n=neqns+i; 

yout[n]=ytemp[n]/h+halfh*yout[i]; 
yout [i] =ytemp [i] ; 

> 

free_vector(ytemp,l,nv); 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 






734 


Chapter 16. Integration of Ordinary Differential Equations 


Note that for compatibility with bsstep the arrays y and d2y are of length 2 n for a 
system of n second-order equations. The values of y are stored in the first n elements of y, 
while the first derivatives are stored in the second n elements. The right-hand side / is stored 
in the first n elements of the array d2y; the second n elements are unused. With this storage 
arrangement you can use bsstep simply by replacing the call to mmid with one to stoerm 
using the same arguments; just be sure that the argument nv of bsstep is set to 2n. You 
should also use the more efficient sequence of stepsizes suggested by Deuflhard: 

n = 1,2,3,4, 5,... (16.5.6) 


and set KMAXX = 12 in bsstep. 


CITED REFERENCES AND FURTHER READING: 
Deuflhard, R 1985, SIAM Review , vol. 27, pp. 505-535. 


16.6 Stiff Sets of Equations 

As soon as one deals with more than one first-order differential equation, the 
possibility of a stiff set of equations arises. Stiffness occurs in a problem where 
there are two or more very different scales of the independent variable on which 
the dependent variables are changing. For example, consider the following set 
of equations [1 ]: 


v! = 998u + 1998u 
v' = —999u — 1999u 


with boundary conditions 


u(0) = 1 v(0) = 0 


(16.6.1) 


(16.6.2) 



By means of the transformation 


u = 2y — z v = —y + z 


(16.6.3) 


0) 

CD 


we find the solution 


u = 2e~ x - e - 1000x 
v = — e ~ x + g-iooox 


(16.6.4) 


S. I | 


If we integrated the system (16.6.1) with any of the methods given so far in this 
chapter, the presence of the e -1000 * term would require a stepsize h -C 1/1000 for 
the method to be stable (the reason for this is explained below). This is so even 
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Figure 16.6.1. Example of an instability encountered in integrating a stiff equation (schematic). Here 
it is supposed that the equation has two solutions, shown as solid and dashed lines. Although the initial 
conditions are such as to give the solid solution, the stability of the integration (shown as the unstable 
dotted sequence of segments) is determined by the more rapidly varying dashed solution, even after that 
solution has effectively died away to zero. Implicit integration methods are the cure. 

though the e -1000 * term is completely negligible in determining the values of u and 
v as soon as one is away from the origin (see Figure 16.6.1). 

This is the generic disease of stiff equations: we are required to follow the 
variation in the solution on the shortest length scale to maintain stability of the 
integration, even though accuracy requirements allow a much larger stepsize. 

To see how we might cure this problem, consider the single equation 

y' = ~cy (16.6.5) 

where c > 0 is a constant. The explicit (or forward) Euler scheme for integrating 
this equation with stepsize h is 


Vn+1 =Vn + hy' n = (1 - ch)y n 


(16.6.6) 


The method is called explicit because the new value y n +i is given explicitly in 
terms of the old value y n . Clearly the method is unstable if h > 2/c, for then 
\y n \ —> oo as n —> oo. 

The simplest cure is to resort to implicit differencing, where the right-hand side 
is evaluated at the new y location. In this case, we get the backward Euler scheme: 


or 


Vn+i — yn J r hy n +1 


(16.6.7) 


Vn+l — 


Vn 

1 + ch 


(16.6.8) 


The method is absolutely stable: even as h —> oo, y n + i —» 0, which is in fact the 
correct solution of the differential equation. If we think of x as representing time, 
then the implicit method converges to the true equilibrium solution (i.e., the solution 
at late times) for large stepsizes. This nice feature of implicit methods holds only 
for linear systems, but even in the general case implicit methods give better stability. 
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Of course, we give up accuracy in following the evolution towards equilibrium if 
we use large stepsizes, but we maintain stability. 

These considerations can easily be generalized to sets of linear equations with 
constant coefficients: 


y' = —C • y (16.6.9) 

where C is a positive definite matrix. Explicit differencing gives 

y„ M = (l-C/t).y„ (16.6.10) 

Now a matrix A" tends to zero as n —* oo only if the largest eigenvalue of A 
has magnitude less than unity. Thus y n is bounded as n —> oo only if the largest 
eigenvalue of 1 — C h is less than 1, or in other words 

h < 2 (16.6.11) 

^max 

where A max is the largest eigenvalue of C. 

On the other hand, implicit differencing gives 

y«+i = y n + hy'n+1 (16.6.12) 

or 

y n+1 = (l + Ch)- 1 -y„ (16.6.13) 

If the eigenvalues of C are A, then the eigenvalues of (1 + C h ) -1 are (1 + Ah) -1 , 
which has magnitude less than one for all h. (Recall that all the eigenvalues of a 
positive definite matrix are nonnegative.) Thus the method is stable for all stepsizes 
h. The penalty we pay for this stability is that we are required to invert a matrix 
at each step. 

Not all equations are linear with constant coefficients, unfortunately! For 
the system 


y' =f(y| (16.6.14) 

implicit differencing gives 

y„+i = y n + ht (y n +i) (16.6.15) 

In general this is some nasty set of nonlinear equations that has to be solved iteratively 
at each step. Suppose we try linearizing the equations, as in Newton’s method: 


yn+i =y n + h 


% J + 



• (y«+i - y J 


(16.6.16) 


Here df/dy is the matrix of the partial derivatives of the right-hand side (the Jacobian 
matrix). Rearrange equation (16.6.16) into the form 


y n +i =y n + h 


1 -h 


■ f(y J 



(16.6.17) 
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If h is not too big, only one iteration of Newton’s method may be accurate enough 
to solve equation (16.6.15) using equation (16.6.17). In other words, at each step 
we have to invert the matrix 

1 -h^- (16.6.18) 

dy 

to find y n+1 . Solving implicit methods by linearization is called a “semi-implicit” 
method, so equation (16.6.17) is the semi-implicit Euler method. It is not guaranteed 
to be stable, but it usually is, because the behavior is locally similar to the case of 
a constant matrix C described above. 

So far we have dealt only with implicit methods that are first-order accurate. 
While these are very robust, most problems will benefit from higher-order methods. 
There are three important classes of higher-order methods for stiff systems: 

• Generalizations of the Runge-Kutta method, of which the most useful 
are the Rosenbrock methods. The first practical implementation of these 
ideas was by Kaps and Rentrop, and so these methods are also called 
Kaps-Rentrop methods. 

• Generalizations of the Bulirsch-Stoer method, in particular a semi-implicit 
extrapolation method due to Bader and Deuflhard. 

• Predictor-corrector methods, most of which are descendants of Gear’s 
backward differentiation method. 

We shall give implementations of the first two methods. Note that systems where 
the right-hand side depends explicitly on x, f(y, x), can be handled by adding x to 
the list of dependent variables so that the system to be solved is 


(:)-(!') < 16 - 6 - 19 > 


In both the routines to be given in this section, we have explicitly carried out this 
replacement for you, so the routines can handle right-hand sides of the form f(y, x) 
without any special effort on your part. 

We now mention an important point: It is absolutely crucial to scale your vari¬ 
ables properly when integrating stiff problems with automatic stepsize adjustment. 
As in our nonstiff routines, you will be asked to supply a vector y scal with which 
the error is to be scaled. For example, to get constant fractional errors, simply set 
y sca i = |y|. You can get constant absolute errors relative to some maximum values 
by setting y scal equal to those maximum values. In stiff problems, there are often 
strongly decreasing pieces of the solution which you are not particularly interested 
in following once they are small. You can control the relative error above some 
threshold C and the absolute error below the threshold by setting 


y sca i = max(C, |y|) (16.6.20) 


If you are using appropriate nondimensional units, then each component of C should 
be of order unity. If you are not sure what values to take for C, simply try 
setting each component equal to unity. We strongly advocate the choice (16.6.20) 
for stiff problems. 

One final warning: Solving stiff problems can sometimes lead to catastrophic 
precision loss. Be alert for situations where double precision is necessary. 
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Rosenbrock Methods 


These methods have the advantage of being relatively simple to understand and imple¬ 
ment. For moderate accuracies (e ^ 10 -4 - 10 -5 in the error criterion) and moderate-sized 
systems (N £ 10), they are competitive with the more complicated algorithms. For more 
stringent parameters, Rosenbrock methods remain reliable; they merely become less efficient 
than competitors like the semi-implicit extrapolation method (see below). 

A Rosenbrock method seeks a solution of the form 

y(a ; 0 + fi) = y 0 + ^ Ciki (16.6.21) 

i=l 

where the corrections k; are found by solving s linear equations that generalize the structure 
in (16.6.17): 

(1 - ■yhf') .k i = hf(y 0 + £ atjkj J + ftf' • y] 7ykj, £= 1,. .., s ( 16 . 6 . 22 ) 

V 3 =1 / 3 =1 

Here we denote the Jacobian matrix by f'. The coefficients 7 , c. t , a ,,, and 7 ij are fixed 
constants independent of the problem. If 7 = 7 ^ = 0, this is simply a Runge-Kutta scheme. 

Equations (16.6.22) can be solved successively for ki,k 2 ,_ 

Crucial to the success of a stiff integration scheme is an automatic stepsize adjustment 
algorithm. Kaps and Rentrop [2] discovered an embedded or Runge-Kutta-Fehlberg method 
as described in §16.2; Two estimates of the form (16.6.21) are computed, the “real” one y and 
a lower-order estimate y with different coefficients Ci,i = 1 where s < s but the k, 

are the same. The difference between y and y leads to an estimate of the local truncation error, 
which can then be used for stepsize control. Kaps and Rentrop showed that the smallest value 
of s for which embedding is possible is s == 4, s = 3, leading to a fourth-order method. 

To minimize the matrix-vector multiplications on the right-hand side of (16.6.22), we 
rewrite the equations in terms of quantities 



Si = y 7ijkj +7ki 
3 =1 



The equations then take the form 
(I/ 7/1 - f') • g! = f(y 0 ) 

(1/7 h - f') • g 2 = f(y 0 + &igJ; + c 2 igi/fi 

(1/7 h - f') • g 3 = f(y 0 + a3igi + a 32 g 2 ) + (c 3 igi + c 32 g 2 )A 

(I/ 7/1 - f') • g 4 = f(y 0 + a 4 ig! + a 42 g 2 + a 4 3 g 3 ) + (c 4 igi + c 42 g 2 + c 43 g 3 )//i 

(16.6.24) 

In our implementation stiff of the Kaps-Rentrop algorithm, we have carried out the 
replacement (16.6.19) explicitly in equations (16.6.24), so you need not concern yourself 
about it. Simply provide a routine (called derivs in stiff) that returns f (called dydx) as a 
function of x and y. Also supply a routine jacobn that returns f' (dfdy) and dt/dx (dfdx) 
as functions of x and y. If x does not occur explicitly on the right-hand side, then dfdx will 
be zero. Usually the Jacobian matrix will be available to you by analytic differentiation of 
the right-hand side f. If not, your routine will have to compute it by numerical differencing 
with appropriate increments Ay. 

Kaps and Rentrop gave two different sets of parameters, which have slightly different 
stability properties. Several other sets have been proposed. Our default choice is that of 
Shampine [3], but we also give you one of the Kaps-Rentrop sets as an option. Some proposed 
parameter sets require function evaluations outside the domain of integration; we prefer to 
avoid that complication. 

The calling sequence of stiff is exactly the same as the nonstiff routines given earlier 
in this chapter. It is thus “plug-compatible” with them in the general ODE integrating routine 



Sample page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5 





16.6 Stiff Sets of Equations 


739 


odeint. This compatibility requires, unfortunately, one slight anomaly: While the user- 
supplied routine derivs is a dummy argument (which can therefore have any actual name), 
the other user-supplied routine is not an argument and must be named (exactly) jacobn. 

stiff begins by saving the initial values, in case the step has to be repeated because 
the error tolerance is exceeded. The linear equations (16.6.24) are solved by first computing 
the LU decomposition of the matrix I/ 7 h — f' using the routine ludcmp. Then the four 
g, are found by back-substitution of the four different right-hand sides using lubksb. Note 
that each step of the integration requires one call to jacobn and three calls to derivs (one 
call to get dydx before calling stiff, and two calls inside stiff). The reason only three 
calls are needed and not four is that the parameters have been chosen so that the last two 
calls in equation (16.6.24) are done with the same arguments. Counting the evaluation of 
the Jacobian matrix as roughly equivalent to N evaluations of the right-hand side f, we see 
that the Kaps-Rentrop scheme involves about N 4- 3 function evaluations per step. Note that 
if N is large and the Jacobian matrix is sparse, you should replace the LU decomposition 
by a suitable sparse matrix procedure. 

Stepsize control depends on the fact that 


y exact = y + 0 (h 5 ) 
y exact = y + °( ft4 ) 


(16.6.25) 


Thus 

|y-y| = o(h A ) 


(16.6.26) 


Referring back to the steps leading from equation (16.2.4) to equation (16.2.10), we see that 
the new stepsize should be chosen as in equation (16.2.10) but with the exponents 1/4 and 1/5 
replaced by 1/3 and 1/4, respectively. Also, experience shows that it is wise to prevent too large 
a stepsize change in one step, otherwise we will probably have to undo the large change in the 
next step. We adopt 0.5 and 1.5 as the maximum allowed decrease and increase of h in one step. 


#include <math.h> 

#include "nrutil.h" 

#define SAFETY 0.9 
#define GROW 1.5 
#define PGROW -0.25 
#define SHRNK 0.5 
#define PSHRNK (-1.0/3.0) 

#define ERRC0N 0.1296 
#define MAXTRY 40 

Here NMAX is the maximum value of n; GROW and SHRNK are the largest and smallest factors 
by which stepsize can change in one step; ERRC 0 N equals (GROW/SAFETY) raised to the power 
( 1 /PGR 0 W) and handles the case when errmax ~ 0 . 

#define GAM (1.0/2.0) 

#define A21 2.0 

#define A31 (48.0/25.0) 

#define A32 (6.0/25.0) 

#define C21 -8.0 

#define C31 (372.0/25.0) 

#define C32 (12.0/5.0) 

#define C41 (-112.0/125.0) 

#define C42 (-54.0/125.0) 

#define C43 (-2.0/5.0) 

#define B1 (19.0/9.0) 

#define B2 (1.0/2.0) 

#define B3 (25.0/108.0) 

#define B4 (125.0/108.0) 

#define El (17.0/54.0) 

#define E2 (7.0/36.0) 

#define E3 0.0 

#define E4 (125.0/108.0) 
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#define C1X (1.0/2.0) 

#define C2X (-3.0/2.0) 

#define C3X (121.0/50.0) 

#define C4X (29.0/250.0) 

#define A2X 1.0 
#define A3X (3.0/5.0) 

void stiff (float y[] , float dydx[] , int n, float *x, float htry, float eps, 
float yscal[], float *hdid, float *hnext, 
void (*derivs) (float, float [] , float [])) 

Fourth-order Rosenbrock step for integrating stiff o.d.e.'s, with monitoring of local truncation 
error to adjust stepsize. Input are the dependent variable vector y[l. .n] and its derivative 
dydx[l. .n] at the starting value of the independent variable x. Also input are the stepsize to 
be attempted htry, the required accuracy eps, and the vector yscal[l. .n] against which 
the error is scaled. On output, y and x are replaced by their new values, hdid is the stepsize 
that was actually accomplished, and hnext is the estimated next stepsize. derivs is a user- 
supplied routine that computes the derivatives of the right-hand side with respect to x, while 
jacobn (a fixed name) is a user-supplied routine that computes the Jacobi matrix of derivatives 
of the right-hand side with respect to the components of y. 

{ 

void jacobn(float x, float y[] , float dfdx[], float **dfdy, int n); 
void lubksb(float **a, int n, int *indx, float b[]); 
void ludcmp(float **a, int n, int *indx, float *d); 
int i, j ,jtry,*indx; 

float d ,errmax,h, xsav, **a,*df dx , **dfdy , *dysav ,*err; 
float *gl, *g2, *g3, *g4, *ysav; 

indx=ivector(l,n); 
a=matrix(l,n,1,n); 
dfdx=vector(1,n); 
dfdy=matrix(l,n,1,n); 
dysav=vector(l,n); 
err=vector(l,n); 
gl=vector(l,n); 
g2=vector(1,n); 
g3=vector(1,n); 
g4=vector(l,n); 
ysav=vector(1,n); 

xsav=(*x) ; Save initial values, 

for (i=l;i<=n;i++) { 
ysav[i]=y[i] ; 
dysav [i] =dydx [i] ; 

> 

jacobn(xsav,ysav,dfdx,dfdy,n); 

The user must supply this routine to return the n-by-n matrix dfdy and the vector dfdx. 
h=htry; Set stepsize to the initial trial value, 

for (jtry=l;jtry<=MAXTRY;jtry++) { 

for (i=l;i<=n;i++) { Set up the matrix 1 — 7 hf. 

for (j=l; j<=n; j++) a[i] [j] = -dfdy[i] [j] ; 
a[i] [i] += 1.0/(GAM*h); 

> 

ludcmp(a,n,indx,&d); LU decomposition of the matrix, 

for (i=l;i<=n;i++) Set up right-hand side for gj. 

gl[i] =dysav[i]+h*ClX*dfdx [i]; 
lubksb(a,n,indx,gl) ; Solve for gj. 

for (i=l;i<=n;i++) Compute intermediate values of y and x. 

y [i] =ysav [i] +A21*gl [i] ; 

*x=xsav+A2X*h; 

(*derivs) (*x,y,dydx) ; Compute dydx at the intermediate values, 

for (i=l;i<=n;i++) Set up right-hand side for g 2 . 

g2 [i] =dydx [i] +h*C2X*df dx [i] +C21*gl [i] /h; 
lubksb(a,n,indx,g2) ; Solve for g 2 . 

for (i=l;i<=n;i++) Compute intermediate values of y and x. 

y [i] =ysav [i] +A31*gl [i] +A32*g2 [i] ; 
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*x=xsav+A3X*h; 

(*derivs) (*x,y,dydx); Compute dydx at the intermediate values, 

for (i=l;i<=n;i++) Set up right-hand side for g 3 . 

g3 [i] =dydx [i] +h*C3X*df dx [i] + (C31*gl [i] +C32*g2 [i])/h; 
lubksb(a,n,indx,g3); Solve for g 3 . 

for (i=l;i<=n;i++) Set up right-hand side for g 4 . 

g4[i]=dydx[i]+h*C4X*dfdx[i] + (C41*gl[i] +C42*g2[i] +C43*g3[i])/h; 
lubksb(a,n,indx,g4); Solve for g 4 . 

for (i=l;i<=n;i++) { Get fourth-order estimate of y and error estimate, 

y [i] =ysav [i] +Bl*gl [i] +B2*g2 [i] +B3*g3 [i] +B4*g4[i] ; 
err [i] =El*gl [i] +E2*g2 [i] +E3*g3 [i] +E4*g4 [i] ; 

> 

*x=xsav+h; 

if (*x == xsav) nrerror("stepsize not significant in stiff"); 
errmax=0.0; Evaluate accuracy. 

for (i=l;i<=n;i++) errmax=FMAX(errmax,fabs(err[i]/yscal[i])); 
errmax /= eps; Scale relative to required tolerance, 

if (errmax <= 1.0) { Step succeeded. Compute size of next step and re- 

*hdid=h; turn. 

*hnext=(errmax > ERRCON ? SAFETY*h*pow(errmax,PGROW) : GR0W*h); 

free_vector(ysav,l,n); 

free_vector(g4,l,n); 

free_vector(g3,1,n); 

free_vector(g2,l,n); 

free_vector(gl,l,n); 

free_vector(err,1,n); 

free_vector(dysav,l,n); 

free_matrix(dfdy,l,n,l,n); 

free_vector(dfdx,1,n); 

free_matrix(a,1,n,l,n); 

free_ivector(indx,l,n); 

return; 

> else { Truncation error too large, reduce stepsize. 

*hnext=SAFETY*h*pow(errmax,PSHRNK); 

h=(h >= 0.0 ? FMAX(*hnext,SHRNK*h) : FMIN(*hnext,SHRNK*h)); 

> 

> Go back and re-try step. 

nrerrorC'exceeded MAXTRY in stiff"); 

> 


Here are the Kaps-Rentrop parameters, which can be substituted for those of Shampine 
simply by replacing the #define statements: 


#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 
#define 


GAM 0.231 
A21 2.0 

A31 4.52470820736 
A32 4.16352878860 
C21 -5.07167533877 
C31 6.02015272865 
C32 0.159750684673 
C41 -1.856343618677 
C42 -8.50538085819 
C43 -2.08407513602 
B1 3.95750374663 
B2 4.62489238836 
B3 0.617477263873 
B4 1.282612945268 
El -2.30215540292 
E2 -3.07363448539 
E3 0.873280801802 
E4 1.282612945268 
C1X GAM 
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#define C2X -0.396296677520e-01 
#define C3X 0.550778939579 
#define C4X -0.553509845700e-01 
#define A2X 0.462 
#define A3X 0.880208333333 


As an example of how stiff is used, one can solve the system 
y'i = -.013t/i - IOOO 1 / 12/3 

2/a = -25 001/22/3 (16.6.27) 

2/3 = -.0132/1 - IOOO 2 / 12/3 - 25002/22/3 

with initial conditions 

2 /i( 0 ) = 1 , 1 / 2 ( 0 ) — 1 • 1 / 3 ( 0 ) =0 (16.6.28) 

(This is test problem D4 in [4].) We integrate the system up to x = 50 with an initial stepsize 
of h = 2.9 x 10 -4 using odeint. The components of C in (16.6.20) are all set to unity. 
The routines derivs and j acobn for this problem are given below. Even though the ratio 
of largest to smallest decay constants for this problem is around 10 6 , stiff succeeds in 
integrating this set in only 29 steps with e = 10 -4 . By contrast, the Runge-Kutta routine 
rkqs requires 51,012 steps! 


void jacobn(float x, float y[], float dfdx[], float **dfdy, int n) 

{ 

int i; 


for (i=l;i<=n;i++) dfdx[i]=0.0; 

dfdy[l] [1] = -0.013-1000.0*y [3] ; 

dfdy[1] [2]=0.0; 

dfdy[l] [3] = -1000.0*y[l]; 

dfdy[2] [11=0.0; 

dfdy[2] [2] = -2500.0*y[3] ; 

dfdy[2] [3] = -2500.0*y[2] ; 

dfdy[3][1] = -0.013-1000.0*y [3]; 

dfdy[3] [2] = -2500.0*y [3] ; 

dfdy [3] [3] = -1000.0*y [1] -2500.0*y [2] ; 


void derivs(float x, float y[], float dydx[]) 

f 

dydx[l] = -0.013*y[1]-1000.0*y[1]*y[3] ; 
dydx[2] = -2500.0*y[2]*y[3] ; 

dydx[3] = -0.013*y[1]-1000.0*y[1]*y[3]-2500.0*y[2]*y[3] ; 

} 



Semi-implicit Extrapolation Method | i | 

^ 3; 

The Bulirsch-Stoer method, which discretizes the differential equation using the modified “ ? 

midpoint rule, does not work for stiff problems. Bader and Deuflhard [5] discovered a semi- 
implicit discretization that works very well and that lends itself to extrapolation exactly as 
in the original Bulirsch-Stoer method. 

The starting point is an implicit form of the midpoint rule: 


(y n+i + y„-i 


y n+ i - y„_i = 2/if 


2 


(16.6.29) 
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Convert this equation into semi-implicit form by linearizing the right-hand side about f(y n ). 
The result is the semi-implicit midpoint rule'. 

[l - ftg] ' y n +i = [l + • y„-i + 2 h [f(yj -|-y n ] (16.6.30) 

It is used with a special first step, the semi-implicit Euler step (16.6.17), and a special 
“smoothing” last step in which the last y n is replaced by 

y n = §(yn+i +yn-i) (16.6.31) 

Bader and Deuflhard showed that the error series for this method once again involves only 
even powers of h. 

For practical implementation, it is better to rewrite the equations using Ak = y k+1 — y k . 
With h = H/m, start by calculating 


Ao=[i-hg] ' • /'%(,) 

yi = y 0 + A o 


(16.6.32) 


Then for k = 1,... ,m — 1, set 

A fc = A fc _! + 2 [l - hg] 1 • [hf(y k ) - A*-!] 

y/c+i = y k + A k 


(16.6.33) 


Finally compute 

A m =[l -h^\ '•[/ i f(y m )- A m ,.. 1 ] 

y m = y m + A m 


(16.6.34) 


It is easy to incorporate the replacement (16.6.19) in the above formulas. The additional 
terms in the Jacobian that come from dt/dx all cancel out of the semi-implicit midpoint rule 
(16.6.30). In the special first step (16.6.17), and in the corresponding equation (16.6.32), the 
term ht becomes hi + h 2 dt/dx. The remaining equations are all unchanged. 

This algorithm is implemented in the routine simpr: 


#include "nrutil.h" 

void simpr(float y[] , float dydx[], float dfdx[], float **dfdy, int n, 
float xs, float htot, int nstep, float yout[], 
void (*derivs) (float, float [] , float [])) 

Performs one step of semi-implicit midpoint rule. Input are the dependent variable y [1. .n] , its 
derivative dydx [1. .n] , the derivative of the right-hand side with respect to x, dfdx [1. . n] , 
and the Jacobian dfdy[l. .n] [1. .n] at xs. Also input are htot, the total step to be taken, 
and nstep, the number of substeps to be used. The output is returned as yout[l. ,n], 
derivs is the user-supplied routine that calculates dydx. 

{ 

void lubksb(float **a, int n, int *indx, float b[J); 
void ludcmp(float **a, int n, int *indx, float *d); 
int i,j,nn,*indx; 
float d,h,x,**a,*del,*ytemp; 

indx=ivector(l,n); 
a=matrix(l,n,1,n); 
del=vector(l,n); 
ytemp=vector(l,n); 
h=htot/nstep; 
for (i=l;i<=n;i++) { 

for (j=l; j<=n; j++) a[i] [j] 


Stepsize this trip. 

Set up the matrix 1 — hi'. 
= -h*dfdy [i] [j] ; 
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> 


++a [i] [i]; 

> 

ludcmp(a,n,indx,fed); 
for (i=l;i<=n;i++) 

yout [i] =h* (dydx [i] +h*df dx [i] ) ; 
lubksb(a,n,indx,yout); 
for (i=l;i<=n;i++) 

ytemp [i] =y [i] + (del [i] =yout [i]) 
x=xs+h; 

(*derivs)(x,ytemp,yout); 
for (nn=2;nn<=nstep;nn++) { 
for (i=l;i<=n;i++) 

yout [i] =h*yout [i] -del [i] ; 
lubksb(a,n,indx,yout); 
for (i=l;i<=n;i++) 


LU decomposition of the matrix. 

Set up right-hand side for first step. Use yout 
for temporary storage. 

First step. 


Use yout for temporary storage of derivatives. 
General step. 

Set up right-hand side for general step. 


ytemp [i] += (del [i] += 2.0*yout [i]); 
x += h; 

(*derivs)(x,ytemp,yout); 


for (i=l;i<=n;i++) Set up right-hand side for last step. 

yout[i] =h*yout[i]-del [i]; 
lubksb(a,n,indx,yout); 

for (i=l; i<=n;i++) Take last step. 

yout [i] += ytemp [i]; 
free_vector(ytemp,l,n) ; 
free_vector(del,1,n); 
free_matrix(a,1,n,1,n); 
free.ivector(indx,1,n); 


The routine simpr is intended to be used in a routine stifbs that is almost exactly the 
same as bsstep. The only differences are: 

• The stepsize sequence is 

n = 2,6,10,14, 22, 34, 50,, (16.6.35) 

where each member differs from its predecessor by the smallest multiple of 4 that 
makes the ratio of successive terms be < |. The parameter KMAXX is taken to be 7. 

• The work per unit step now includes the cost of Jacobian evaluations as well 
as function evaluations. We count one Jacobian evaluation as equivalent to N 
function evaluations, where N is the number of equations. 

• Once again the user-supplied routine derivs is a dummy argument and so can have 
any name. However, to maintain “plug-compatibility” with rkqs, bsstep and 
stiff, the routine j acobn is not an argument and must have exactly this name. It 
is called once per step to return f' (df dy) and df/dx (df dx) as functions of x and y. 

Here is the routine, with comments pointing out only the differences from bsstep: 

#include <math.h> 

#include "nrutil.h" 

#define KMAXX 7 

#define IMAXX (KMAXX+1) 

#define SAFE1 0.25 
#define SAFE2 0.7 
#define REDMAX 1.0e-5 
#define REDMIN 0.7 
#define TINY 1.0e-30 
#define SCALMX 0.1 

float **d,*x; 

void stifbs(float y[], float dydx[], int nv, float *xx, float htry, float eps, 
float yscalf], float *hdid, float *hnext, 
void (*derivs)(float, float [] , float [])) 
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Semi-implicit extrapolation step for integrating stiff o.d.e.'s, with monitoring of local truncation 
error to adjust stepsize. Input are the dependent variable vector y[l. .nv] and its derivative 
dydx[l. .nv] at the starting value of the independent variable x. Also input are the stepsize 
to be attempted htry, the required accuracy eps, and the vector yscal[l. .nv] against 
which the error is scaled. On output, y and x are replaced by their new values, hdid is the 
stepsize that was actually accomplished, and hnext is the estimated next stepsize. derivs 
is a user-supplied routine that computes the derivatives of the right-hand side with respect to 
x, while jacobn (a fixed name) is a user-supplied routine that computes the Jacobi matrix of 
derivatives of the right-hand side with respect to the components of y. Be sure to set htry 
on successive steps to the value of hnext returned from the previous step, as is the case if the 
routine is called by odeint. 

{ 

void jacobn(float x, float y[], float dfdx[], float **dfdy, int n); 
void simpr(float y[], float dydx[], float dfdx[], float **dfdy, 
int n, float xs, float htot, int nstep, float yout[], 
void Oderivs) (float, float [], float [])); 
void pzextr(int iest, float xest, float yest[], float yz[], float dy[] , 
int nv); 

int i,iq,k,kk,kin; 

static int first=l,kmax,kopt,nvold = -1; 
static float epsold = -1.0,xnew; 

float epsl,errmax,fact,h,red.scale,work,wrkmin,xest; 
float *dfdx,**dfdy,*err,*yerr,*ysav,*yseq; 
static float a[IMAXX+l] ; 
static float alf[KMAXX+l][KMAXX+1]; 

static int nseq[IMAXX+1] ={0,2,6,10,14,22,34,50,70}; Sequence is different from 

int reduct,exitflag=0; bsstep. 

d=matrix(1,nv,1,KMAXX); 
dfdx=vector(1,nv); 
dfdy=matrix(l,nv,l,nv); 
err=vector(l,KMAXX); 
x=vector(l,KMAXX); 
yerr=vector(1,nv); 
ysav=vector(1,nv); 
yseq=vector(1,nv); 

if (eps != epsold I I nv != nvold) { Reinitialize also if nv has changed. 

♦hnext = xnew = -1.0e29; 
epsl=SAFEl*eps; 
a[l]=nseq[l]+l; 

for (k=l; k<=KMAXX; k++) a[k+l]=a[k]+nseq[k+l] ; 
for (iq=2;iq<=KMAXX;iq++) { 
for (k=l;k<iq;k++) 

alf [k] [iq]=pow(epsl, ((a[k+l] -a[iq+l] )/ 

((a[iq+l] -a[l] +1.0)*(2*k+l)))); 

} 

epsold=eps; 

nvold=nv; Save nv. 

a[l] += nv; Add cost of Jacobian evaluations to work 

for (k=l; k<=KMAXX; k++) a[k+l]=a[k]+nseq[k+l] ; coefficients, 

for (kopt=2;kopt<KMAXX;kopt++) 

if (a[kopt+l] > a[kopt]*alf[kopt-1][kopt]) break; 
kmax=kopt; 

} 

h=htry; 

for (i=l;i<=nv;i++) ysav[i]=y[i] ; 

jacobn(*xx,y,dfdx,dfdy,nv); Evaluate Jacobian, 

if (*xx != xnew I I h != (*hnext)) { 
first=l; 
kopt=kmax; 

} 

reduct=0; 
for (;;) { 

for (k=l;k<=kmax;k++) { 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 






746 


Chapter 16. Integration of Ordinary Differential Equations 


xnew=(*xx)+h; 

if (xnew == (*xx)) nrerror("step size underflow in stifbs"); 
simpr(ysav,dydx,dfdx,dfdy,nv,*xx,h,nseq[k],yseq,derivs); 

Semi-implicit midpoint rule. 

xest=SQR(h/nseq[k]); The rest of the routine is identical to 

pzextr(k,xest,yseq,y,yerr,nv); bsstep. 

if (k != 1) { 
errmax=TINY; 

for (i=l;i<=nv;i++) errmax=FMAX(errmax,fabs(yerr[i]/yscal[i])); 
errmax /= eps; 
km=k-l; 

err[km]=pow(errmax/SAFEl,1.0/(2*km+l)); 

> 

if (k != 1 kk (k >= kopt-1 I I first)) { 
if (errmax < 1.0) { 
exitflag=l; 
break; 

> 

if (k == kmax I I k == kopt+1) { 
red=SAFE2/err [km]; 
break; 

> 

else if (k == kopt kk alf [kopt-1] [kopt] < err [km]) { 
red=l.0/err[km]; 
break; 

} 

else if (kopt == kmax kk alf [km] [kmax-1] < err [km]) { 
red=alf[km][kmax-1]*SAFE2/err[km]; 
break; 

> 

else if (alf[km][kopt] < err[km]) { 
red=alf[km][kopt-1]/err [km]; 
break; 

> 

> 

} 

if (exitflag) break; 
red=FMIN(red,REDMIN); 
red=FMAX(red,REDMAX); 
h *= red; 
reduct=l; 

> 

*xx=xnew; 

*hdid=h; 

first=0; 

wrkmin=l.0e35; 

for (kk=l;kk<=km;kk++) { 

fact=FMAX(err[kk].SCALMX); 
work=fact*a[kk+l]; 
if (work < wrkmin) { 
scale=fact; 
wrkmin=work; 
kopt=kk+l; 

> 

> 

*hnext=h/scale; 

if (kopt >= k kk kopt != kmax kk !reduct) { 
fact=FMAX(scale/alf[kopt-1][kopt],SCALMX); 
if (a[kopt+1]*fact <= wrkmin) { 

*hnext=h/fact; 
kopt++; 

> 

> 

free_vector(yseq,1,nv); 
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free_vector(ysav,1,nv); 
free_vector(yerr,1,nv); 
free.vector(x,1,KMAXX); 
free_vector(err,1,KMAXX); 
f ree_matrix(dfdy, 1, 11 V, 1 ,nv) ; 
free_vector(dfdx,1,nv); 
free matrix(d,1,nv,1,KMAXX); 

> 


The routine stifbs is an excellent routine for all stiff problems, competitive with 
the best Gear-type routines, stiff is comparable in execution time for moderate N and 
e ^ 10 -4 . By the time e ~ 10 -8 , stifbs is roughly an order of magnitude faster. There 
are further improvements that could be applied to stifbs to make it even more robust. For 
example, very occasionally ludcmp in simpr will encounter a singular matrix. You could 
arrange for the stepsize to be reduced, say by a factor of the current nseq[k]. There are 
also certain stability restrictions on the stepsize that come into play on some problems. For 
a discussion of how to implement these automatically, see [6], 
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16.7Multistep, Multivalue, and 
Predictor-Corrector Methods 



The terms multistep and multivalue describe two different ways of implementing 5. j r- 
essentially the same integration technique for ODEs. Predictor-corrector is a partic- 
ular subcategrory of these methods — in fact, the most widely used. Accordingly, S' g 

the name predictor-corrector is often loosely used to denote all these methods. 

We suspect that predictor-corrector integrators have had their day, and that they 
are no longer the method of choice for most problems in ODEs. For high-precision 
applications, or applications where evaluations of the right-hand sides are expensive, 
Bulirsch-Stoer dominates. For convenience, or for low precision, adaptive-stepsize 
Runge-Kutta dominates. Predictor-corrector methods have been, we think, squeezed 
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out in the middle. There is possibly only one exceptional case: high-precision 
solution of very smooth equations with very complicated right-hand sides, as we 
will describe later. 

Nevertheless, these methods have had a long historical run. Textbooks are 
full of information on them, and there are a lot of standard ODE programs around 
that are based on predictor-corrector methods. Many capable researchers have a 
lot of experience with predictor-corrector routines, and they see no reason to make 
a precipitous change of habit. It is not a bad idea for you to be familiar with the 
principles involved, and even with the sorts of bookkeeping details that are the bane 
of these methods. Otherwise there will be a big surprise in store when you first have 
to fix a problem in a predictor-corrector routine. 

Let us first consider the multistep approach. Think about how integrating an 
ODE is different from finding the integral of a function: For a function, the integrand 
has a known dependence on the independent variable x, and can be evaluated at 
will. For an ODE, the “integrand” is the right-hand side, which depends both on 
x and on the dependent variables y. Thus to advance the solution of y 1 = f(x,y) 
from x n to x, we have 


y(x) = y n + J f(x',y)dx' (16.7.1) 

In a single-step method like Runge-Kutta or Bulirsch-Stoer, the value y n + i at x n+ i 
depends only on y n . In a multistep method, we approximate f(x, y) by a polynomial 
passing through several previous points x n ,x n -i,... and possibly also through 
x n+ i. The result of evaluating the integral (16.7.1) at x = x n+ \ is then of the form 

y n +1 = Vn + h(Poy'n+l + Piy'n + 4-) (16.7.2) 

where y' n denotes ,f(x n . y n ), and so on. If (Jo = 0, the method is explicit; otherwise 
it is implicit. The order of the method depends on how many previous steps we 
use to get each new value of y. 

Consider how we might solve an implicit formula of the form (16.7.2) for y n +i. 
Two methods suggest themselves: functional iteration and Newton’s method. In 
functional iteration, we take some initial guess for y n + 1 , insert it into the right-hand 
side of (16.7.2) to get an updated value of y n + i, insert this updated value back into 
the right-hand side, and continue iterating. But how are we to get an initial guess for 
y n +i ? Easy! Just use some explicit formula of the same form as (16.7.2). This is 
called the predictor step. In the predictor step we are essentially extrapolating the 
polynomial fit to the derivative from the previous points to the new point x „+i and 
then doing the integral (16.7.1) in a Simpson-like manner from x n to x n+ \. The 
subsequent Simpson-like integration, using the prediction step’s value of y n +1 to 
interpolate the derivative, is called the corrector step. The difference between the 
predicted and corrected function values supplies information on the local truncation 
error that can be used to control accuracy and to adjust stepsize. 

If one corrector step is good, aren’t many better? Why not use each corrector 
as an improved predictor and iterate to convergence on each step? Answer: Even if 
you had a perfect predictor, the step would still be accurate only to the finite order 
of the corrector. This incurable error term is on the same order as that which your 
iteration is supposed to cure, so you are at best changing only the coefficient in front 
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of the error term by a fractional amount. So dubious an improvement is certainly not 
worth the effort. Your extra effort would be better spent in taking a smaller stepsize. 

As described so far, you might think it desirable or necessary to predict several 
intervals ahead at each step, then to use all these intervals, with various weights, in 
a Simpson-like corrector step. That is not a good idea. Extrapolation is the least 
stable part of the procedure, and it is desirable to minimize its effect. Therefore, the 
integration steps of a predictor-corrector method are overlapping, each one involving 
several stepsize intervals h, but extending just one such interval farther than the 
previous ones. Only that one extended interval is extrapolated by each predictor step. 

The most popular predictor-corrector methods are probably the Adams- 
Bashforth-Moulton schemes, which have good stability properties. The Adams- 
Bashforth part is the predictor. For example, the third-order case is 

predictor: y n+1 = y n + ^(23y' n - lGy'^ + 5y' n _ 2 ) + Oih 4 ) (16.7.3) 

Here information at the current point x n , together with the two previous points x n _ \ 
and x n -2 (assumed equally spaced), is used to predict the value y n +i at the next 
point, x n+ i. The Adams-Moulton part is the corrector. The third-order case is 

corrector: y n+1 = y n + ^(5y' n+1 + 8y' n - y'^f) + 0(h 4 ) (16.7.4) 

Without the trial value of y n + i from the predictor step to insert on the right-hand 
side, the corrector would be a nasty implicit equation for y n +i- 

There are actually three separate processes occurring in a predictor-corrector 
method: the predictor step, which we call P, the evaluation of the derivative y' n+l 
from the latest value of y, which we call E, and the corrector step, which we call 
C. In this notation, iterating m times with the corrector (a practice we inveighed 
against earlier) would be written P(EC) m . One also has the choice of finishing with 
a C or an E step. The lore is that a final E is superior, so the strategy usually 
recommended is PECE. 

Notice that a PC method with a fixed number of iterations (say, one) is an 
explicit method! When we fix the number of iterations in advance, then the final 
value of y n +i can be written as some complicated function of known quantities. Thus 
fixed iteration PC methods lose the strong stability properties of implicit methods 
and should only be used for nonstiff problems. 

For stiff problems we must use an implicit method if we want to avoid having 
tiny stepsizes. (Not all implicit methods are good for stiff problems, but fortunately 
some good ones such as the Gear formulas are known.) We then appear to have two 
choices for solving the implicit equations: functional iteration to convergence, or 
Newton iteration. However, it turns out that for stiff problems functional iteration will 
not even converge unless we use tiny stepsizes, no matter how close our prediction 
is! Thus Newton iteration is usually an essential part of a multistep stiff solver. For 
convergence, Newton’s method doesn’t particularly care what the stepsize is, as long 
as the prediction is accurate enough. 

Multistep methods, as we have described them so far, suffer from two serious 
difficulties when one tries to implement them: 

• Since the formulas require results from equally spaced steps, adjusting 
the stepsize is difficult. 
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• Starting and stopping present problems. For starting, we need the initial 
values plus several previous steps to prime the pump. Stopping is a 
problem because equal steps are unlikely to land directly on the desired 
termination point. 

Older implementations of PC methods have various cumbersome ways of 
dealing with these problems. For example, they might use Runge-Kutta to start 
and stop. Changing the stepsize requires considerable bookkeeping to do some 
kind of interpolation procedure. Fortunately both these drawbacks disappear with 
the multivalue approach. 

For multivalue methods the basic data available to the integrator are the first 
few terms of the Taylor series expansion of the solution at the current point x n . The 
aim is to advance the solution and obtain the expansion coefficients at the next point 
x n+ \. This is in contrast to multistep methods, where the data are the values of 

the solution at x n ,x n -i, _ We’ll illustrate the idea by considering a four-value 

method, for which the basic data are 


y„ 


( 


Vn N 
hy' n 

I (^ 7 %" 

\(h 3 /Q)y'n > 




(16.7.5) 


It is also conventional to scale the derivatives with the powers of h = x n+ \ — x n as 
shown. Note that here we use the vector notation y to denote the solution and its 
first few derivatives at a point, not the fact that we are solving a system of equations 
with many components y. 

In terms of the data in (16.7.5), we can approximate the value of the solution 
y at some point x: 

y(x) = y n :§'|a; - x n )y' n + ^ ^ Vn (16.7.6) 

Set x = x n +i in equation (16.7.6) to get an approximation to y n +i ■ Differentiate 
equation (16.7.6) and set x = x n+ \ to get an approximation to y' n+ i , and similarly for 
y'n+i and y"' + i • Call the resulting approximation y n +i , where the tilde is a reminder 
that all we have done so far is a polynomial extrapolation of the solution and its 
derivatives; we have not yet used the differential equation. You can easily verify that 


y„ + i = B • y n 


where the matrix B is 


B = 



(16.7.7) 


(16.7.8) 


We now write the actual approximation to y n+1 that we will use by adding a 
correction to y n+ -|: 



y„+i = ar 


(16.7.9) 
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Here r will be a fixed vector of numbers, in the same way that B is a fixed matrix. 
We fix a by requiring that the differential equation 

y'n+i = f(xn+i,y n +i) (16.7.10) 

be satisfied. The second of the equations in (16.7.9) is 

hy'n +1 = h v'n+ 1 + ar 2 (16.7.11) 

and this will be consistent with (16.7.10) provided 

r 2 = 1, a = hf(x n+1 ,y n+1 ) - hy' n+1 (16.7.12) 

The values of n, r^, and r4 are free for the inventor of a given four-value method to 
choose. Different choices give different orders of method (i.e., through what order 
in h the final expression 16.7.9 actually approximates the solution), and different 
stability properties. 

An interesting result, not obvious from our presentation, is that multivalue and 
multistep methods are entirely equivalent. In other words, the value y n + 1 given by 
a multivalue method with given B and r is exactly the same value given by some 
multistep method with given f3’ s in equation (16.7.2). For example, it turns out 
that the Adams-Bashforth formula (16.7.3) corresponds to a four-value method with 
n = 0, r% = 3/4, and r4 = 1/6. The method is explicit because r 1 = 0. The 
Adams-Moulton method (16.7.4) corresponds to the implicit four-value method with 
n = 5/12, r \3 = 3/4, and 7*4 = 1/6. Implicit multivalue methods are solved the 
same way as implicit multistep methods: either by a predictor-corrector approach 
using an explicit method for the predictor, or by Newton iteration for stiff systems. 

Why go to all the trouble of introducing a whole new method that turns out 
to be equivalent to a method you already knew? The reason is that multivalue 
methods allow an easy solution to the two difficulties we mentioned above in actually 
implementing multistep methods. 

Consider first the question of stepsize adjustment. To change stepsize from h 
to h! at some point x n , simply multiply the components of y n in (16.7.5) by the 
appropriate powers of h'/h, and you are ready to continue to x n + h!. 

Multivalue methods also allow a relatively easy change in the order of the 
method: Simply change r. The usual strategy for this is first to determine the new 
stepsize with the current order from the error estimate. Then check what stepsize 
would be predicted using an order one greater and one smaller than the current 
order. Choose the order that allows you to take the biggest next step. Being able to 
change order also allows an easy solution to the starting problem: Simply start with 
a first-order method and let the order automatically increase to the appropriate level. 

For low accuracy requirements, a Runge-Kutta routine like rkqs is almost 
always the most efficient choice. For high accuracy, bsstep is both robust and 
efficient. For very smooth functions, a variable-order PC method can invoke very 
high orders. If the right-hand side of the equation is relatively complicated, so that 
the expense of evaluating it outweighs the bookkeeping expense, then the best PC 
packages can outperform Bulirsch-Stoer on such problems. As you can imagine, 
however, such a variable-stepsize, variable-order method is not trivial to program. If 
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you suspect that your problem is suitable for this treatment, we recommend use of a 
canned PC package. For further details consult Gear [1 ] or Shampine and Gordon [2], 
Our prediction, nevertheless, is that, as extrapolation methods like Bulirsch- 
Stoer continue to gain sophistication, they will eventually beat out PC methods in 
all applications. We are willing, however, to be corrected. 

CITED REFERENCES AND FURTHER READING: 

Gear, C.W. 1971, Numerical Initial Value Problems in Ordinary Differential Equations (Englewood 
Cliffs, NJ: Prentice-Hall), Chapter 9. [1] 

Shampine, L.F., and Gordon, M.K. 1975, Computer Solution of Ordinary Differential Equations. 

The Initial Value Problem. (San Francisco: W.H Freeman). [2] 

Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America), Chapter 5. 

Kahaner, D., Moler, C., and Nash, S. 1989, Numerical Methods and Software (Englewood Cliffs, 
NJ: Prentice Hall), Chapter 8. 

Hamming, R.W. 1962, Numerical Methods for Engineers and Scientists', reprinted 1986 (New 
York: Dover), Chapters 14-15. 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
Chapter 7. 
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17.0 Introduction 


When ordinary differential equations are required to satisfy boundary conditions 
at more than one value of the independent variable, the resulting problem is called a 
two point boundary value problem. As the terminology indicates, the most common 
case by far is where boundary conditions are supposed to be satisfied at two points — 
usually the starting and ending values of the integration. However, the phrase “two 
point boundary value problem” is also used loosely to include more complicated 
cases, e.g., where some conditions are specified at endpoints, others at interior 
(usually singular) points. 

The crucial distinction between initial value problems (Chapter 16) and two 
point boundary value problems (this chapter) is that in the former case we are able 
to start an acceptable solution at its beginning (initial values) and just march it along 
by numerical integration to its end (final values); while in the present case, the 
boundary conditions at the starting point do not determine a unique solution to start 
with — and a “random” choice among the solutions that satisfy these (incomplete) 
starting boundary conditions is almost certain not to satisfy the boundary conditions 
at the other specified point(s). 

It should not surprise you that iteration is in general required to meld these 
spatially scattered boundary conditions into a single global solution of the differential 
equations. For this reason, two point boundary value problems require considerably 
more effort to solve than do initial value problems. You have to integrate your dif¬ 
ferential equations over the interval of interest, or perform an analogous “relaxation” 
procedure (see below), at least several, and sometimes very many, times. Only in 
the special case of linear differential equations can you say in advance just how 
many such iterations will be required. 

The “standard” two point boundary value problem has the following form: We 
desire the solution to a set of N coupled first-order ordinary differential equations, 
satisfying m boundary conditions at the starting point X\, and a remaining set of 
n ,2 = N — n i boundary conditions at the final point x 2 . (Recall that all differential 
equations of order higher than first can be written as coupled sets of first-order 
equations, cf. §16.0.) 

The differential equations are 

dVl(x) = g.^ x yi y 2 ^ y N j * = 1 , 2 , . . . , N (17.0.1) 
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required 
boundary 

value J x 

Figure 17.0.1. Shooting method (schematic). Trial integrations that satisfy the boundary condition at one 
endpoint are “launched.” The discrepancies from the desired boundary condition at the other endpoint are 
used to adjust the starting conditions, until boundary conditions at both endpoints are ultimately satisfied. 

At xi, the solution is supposed to satisfy 

. Vn) = 0 j = (17.0.2) 

while at x 2 , it is supposed to satisfy 

B 2 k(x 2 ,yi,y 2 ,---,yN) = o fc = i,...,n 2 (17.0.3) 

There are two distinct classes of numerical methods for solving two point 
boundary value problems. In the shooting method (§17.1) we choose values for all 
of the dependent variables at one boundary. These values must be consistent with 
any boundary conditions for that boundary, but otherwise are arranged to depend 
on arbitrary free parameters whose values we initially “randomly” guess. We then 
integrate the ODEs by initial value methods, arriving at the other boundary (and/or any 
interior points with boundary conditions specified). In general, we find discrepancies 
from the desired boundary values there. Now we have a multidimensional root¬ 
finding problem, as was treated in §9.6 and §9.7: Find the adjustment of the free 
parameters at the starting point that zeros the discrepancies at the other boundary 
point(s). If we liken integrating the differential equations to following the trajectory 
of a shot from gun to target, then picking the initial conditions corresponds to aiming 
(see Figure 17.0.1). The shooting method provides a systematic approach to taking 
a set of “ranging” shots that allow us to improve our “aim” systematically. 

As another variant of the shooting method (§17.2), we can guess unknown free 
parameters at both ends of the domain, integrate the equations to a common midpoint, 
and seek to adjust the guessed parameters so that the solution joins “smoothly” at 
the fitting point. In all shooting methods, trial solutions satisfy the differential 
equations “exactly” (or as exactly as we care to make our numerical integration), 
but the trial solutions come to satisfy the required boundary conditions only after 
the iterations are finished. 

Relaxation methods use a different approach. The differential equations are 
replaced by finite-difference equations on a mesh of points that covers the range of 
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Figure 17.0.2. Relaxation method (schematic). An initial solution is guessed that approximately satisfies 
the differential equation and boundary conditions. An iterative process adjusts the function to bring it 
into close agreement with the true solution. 


the integration. A trial solution consists of values for the dependent variables at each 
mesh point, not satisfying the desired finite-difference equations, nor necessarily even 
satisfying the required boundary conditions. The iteration, now called relaxation, 
consists of adjusting all the values on the mesh so as to bring them into successively 
closer agreement with the finite-difference equations and, simultaneously, with the 
boundary conditions (see Figure 17.0.2). For example, if the problem involves three 
coupled equations and a mesh of one hundred points, we must guess and improve 
three hundred variables representing the solution. 

With all this adjustment, you may be surprised that relaxation is ever an efficient 
method, but (for the right problems) it really is! Relaxation works better than 
shooting when the boundary conditions are especially delicate or subtle, or where 
they involve complicated algebraic relations that cannot easily be solved in closed 
form. Relaxation works best when the solution is smooth and not highly oscillatory. 
Such oscillations would require many grid points for accurate representation. The 
number and position of required points may not be known a priori. Shooting methods 
are usually preferred in such cases, because their variable stepsize integrations adjust 
naturally to a solution’s peculiarities. 

Relaxation methods are often preferred when the ODEs have extraneous 
solutions which, while not appearing in the final solution satisfying all boundary 
conditions, may wreak havoc on the initial value integrations required by shooting. 
The typical case is that of trying to maintain a dying exponential in the presence 
of growing exponentials. 

Good initial guesses are the secret of efficient relaxation methods. Often one 
has to solve a problem many times, each time with a slightly different value of some 
parameter. In that case, the previous solution is usually a good initial guess when 
the parameter is changed, and relaxation will work well. 

Until you have enough experience to make your own judgment between the two 
methods, you might wish to follow the advice of your authors, who are notorious 
computer gunslingers: We always shoot first, and only then relax. 
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Problems Reducible to the Standard Boundary Problem 

There are two important problems that can be reduced to the standard boundary 
value problem described by equations (17.0.1) - (17.0.3). The first is the eigenvalue 
problem for differential equations. Here the right-hand side of the system of 
differential equations depends on a parameter A, 

dy ^ = 9 i(x, yi , • • •, y N , A); (17.0.4) 

and one has to satisfy N + 1 boundary conditions instead of just N. The problem 
is overdetermined and in general there is no solution for arbitrary values of A. For 
certain special values of A, the eigenvalues, equation (17.0.4) does have a solution. 

We reduce this problem to the standard case by introducing a new dependent 
variable 


yjv +1 m A (17.0.5) 

and another differential equation 

= 0 (17.0.6) 

dx 

An example of this trick is given in §17.4. 

The other case that can be put in the standard form is a free boundary problem. 
Here only one boundary abscissa x\ is specified, while the other boundary x 2 is to 
be determined so that the system (17.0.1) has a solution satisfying a total of N + 1 
boundary conditions. Here we again add an extra constant dependent variable: 

Vn+i =X 2 -X! (17.0.7) 

dy * + ' = 0 (17.0.8) 

dx 

We also define a new independent variable t by setting 

x — x\ = t yjv+i, 0<t<l (17.0.9) 

The system of N + 1 differential equations for dyffdt is now in the standard form, 
with t varying between the known limits 0 and 1. 


CITED REFERENCES AND FURTHER READING: 

Keller, H.B. 1968, Numerical Methods for Two-Point Boundary-Value Problems (Waltham, MA: 
Blaisdell). 

Kippenhan, R., Weigert, A., and Hofmeister, E. 1968, in Methods in Computational Physics, 
vol. 7 (New York: Academic Press), pp. 129ff. 

Eggleton, P.P. 1971, Monthly Notices of the Royal Astronomical Society, vol. 151, pp. 351-364. 
London, R.A., and Flannery, B.P. 1982, Astrophysical Journal, vol. 258, pp. 260-269. 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
§§7.3-7.4. 
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17.1 The Shooting Method 

In this section we discuss “pure” shooting, where the integration proceeds from 
Xi to .Z' 2 , and we try to match boundary conditions at the end of the integration. In 
the next section, we describe shooting to an intermediate fitting point, where the 
solution to the equations and boundary conditions is found by launching “shots” 
from both sides of the interval and trying to match continuity conditions at some 
intermediate point. 

Our implementation of the shooting method exactly implements multidimen¬ 
sional, globally convergent Newton-Raphson (§9.7). It seeks to zero n 2 functions 
of 712 variables. The functions are obtained by integrating N differential equations 
from x i to x- 2 - Let us see how this works: 

At the starting point Xi there are N starting values y, to be specified, but 
subject to n\ conditions. Therefore there are «2 = N — m freely specifiable starting 
values. Let us imagine that these freely specifiable values are the components of a 
vector V that lives in a vector space of dimension n 2 - Then you, the user, knowing 
the functional form of the boundary conditions (17.0.2), can write a function that 
generates a complete set of N starting values y, satisfying the boundary conditions 
at X\, from an arbitrary vector value of V in which there are no restrictions on the n 2 
component values. In other words, (17.0.2) converts to a prescription 

yi(xi) = y i (x 1 ;V 1 ,...,V n2 ) i=l,...,JV (17.1.1) 

Below, the function that implements (17.1.1) will be called load. 

Notice that the components of V might be exactly the values of certain “free” 
components of y, with the other components of y determined by the boundary 
conditions. Alternatively, the components of V might parametrize the solutions that 
satisfy the starting boundary conditions in some other convenient way. Boundary 
conditions often impose algebraic relations among the y j, rather than specific values 
for each of them. Using some auxiliary set of parameters often makes it easier to 
“solve” the boundary relations for a consistent set of yfs. It makes no difference 
which way you go, as long as your vector space of V’s generates (through 17.1.1) 
all allowed starting vectors y. 

Given a particular V, a particular y(ari) is thus generated. It can then be turned 
into a y(a; 2 ) by integrating the ODEs to x -2 as an initial value problem (e.g., using 
Chapter 16’s odeint). Now, at x- 2 , let us define a discrepancy vector F, also of 
dimension ri 2 , whose components measure how far we are from satisfying the ri 2 
boundary conditions at X 2 (17.0.3). Simplest of all is just to use the right-hand 
sides of (17.0.3), 


F k =B 2k (x 2 , y) k = l,...,n 2 (17.1.2) 

As in the case of V, however, you can use any other convenient parametrization, 
as long as your space of F’s spans the space of possible discrepancies from the 
desired boundary conditions, with all components of F equal to zero if and only if 
the boundary conditions at x 2 are satisfied. Below, you will be asked to supply a 
user-written function score which uses (17.0.3) to convert an AT-vector of ending 
values y(x 2 ) into an n 2 -vector of discrepancies F. 
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Now, as far as Newton-Raphson is concerned, we are nearly in business. We 
want to find a vector value of V that zeros the vector value of F. We do this 
by invoking the globally convergent Newton’s method implemented in the routine 
newt of §9.7. Recall that the heart of Newton’s method involves solving the set 
of ri 2 linear equations 


J• g \= -F 

and then adding the correction back, 

ynew = yold + gy 


In (17.1.3), the Jacobian matrix J has components given by 


Jij — 


m 

9Vj 


(17.1.3) 

(17.1.4) 

(17.1.5) 


It is not feasible to compute these partial derivatives analytically. Rather, each 
requires a separate integration of the N ODEs, followed by the evaluation of 

m M F i (V 1 ,...,V j + AV j ,...)-F i (V 1 ,...,V j ,...) 1 

dVj ~ A Vj 1 ' ' 

This is done automatically for you in the routine f dj ac that comes with newt. The 
only input to newt that you have to provide is the routine vecfunc that calculates 
F by integrating the ODEs. Here is the appropriate routine, called shoot, that is 
to be passed as the actual argument in newt: 


#include "nrutil.h" 

#define EPS 1.0e-6 

extern int nvar; Variables that you must define and set in your main pro- 

extern float xl,x2; gram. 

int kmax,kount; Communicates with odeint. 

float *xp,**yp,dxsav; 

void shoot(int n, float v[], float f[]) 

Routine for use with newt to solve a two point boundary value problem for nvar coupled ODEs 
by shooting from xl to x2. Initial values for the nvar ODEs at xl are generated from the n2 
input coefficients v[l. .n2] , using the user-supplied routine load. The routine integrates the 
ODEs to x2 using the Runge-Kutta method with tolerance EPS, initial stepsize hi, and minimum 
stepsize hmin. At x2 it calls the user-supplied routine score to evaluate the n2 functions 
f [1. .n2] that ought to be zero to satisfy the boundary conditions at x2. The functions f 
are returned on output, newt uses a globally convergent Newton's method to adjust the values 
of v until the functions f are zero. The user-supplied routine derivs(x,y,dydx) supplies 
derivative information to the ODE integrator (see Chapter 16). The first set of global variables 
above receives its values from the main program so that shoot can have the syntax required 
for it to be the argument vecfunc of newt. 

{ 

void derivsCfloat x, float y[] , float dydx[]); 
void loadffloat xl, float v[], float y[j); 
void odeintffloat ystart[], int nvar, float xl, float x2, 
float eps, float hi, float hmin, int *nok, int *nbad, 
void (*derivs) (float, float [] , float []), 

void (*rkqs)(float [], float [] , int, float *, float, float, 
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float [], float *, float *, void (*) (float, float [], float □))); 
void rkqs(float y[], float dydx[], int n, float *x, 

float htry, float eps, float yscalO, float *hdid, float *hnext, 
void (*derivs) (float, float [] , float [])); 
void score (float xf, float y[] , float f[]); 
int nbad,nok; 
float hi,hmin=0.0,*y; 

y=vector(1,nvar); 
kmax=0; 

hl=(x2-xl)/100.0; 
load(xl,v,y); 

odeint(y,nvar,xl,x2,EPS,hi, hmin,took,tobad,derivs,rkqs); 
score(x2,y,f); 
free_vector(y,l,nvar); 


For some problems the initial stepsize AV might depend sensitively upon the 
initial conditions. It is straightforward to alter load to include a suggested stepsize 
hi as another output variable and feed it to fdjac via a global variable. 

A complete cycle of the shooting method thus requires n 2 + 1 integrations of 
the N coupled ODEs: one integration to evaluate the current degree of mismatch, 
and 71 2 for the partial derivatives. Each new cycle requires a new round of n 2 + 1 
integrations. This illustrates the enormous extra effort involved in solving two point 
boundary value problems compared with initial value problems. 

If the differential equations are linear , then only one complete cycle is required, 
since (17.1.3)—(17.1.4) should take us right to the solution. A second round can be 
useful, however, in mopping up some (never all) of the roundoff error. 

As given here, shoot uses the quality controlled Runge-Kutta method of § 16.2 
to integrate the ODEs, but any of the other methods of Chapter 16 could just as 
well be used. 

You, the user, must supply shoot with: (i) a function load(xl,v,y) which 
calculates the n-vector y [1. .n] (satisfying the starting boundary conditions, of 
course), given the freely specifiable variables of v[l. .n2] at the initial point xl; 

(ii) a function score(x2,y ,f) which calculates the discrepancy vector f [1. .n2] 
of the ending boundary conditions, given the vector y [1. .n] at the endpoint x2; 

(iii) a starting vector v [1. . n2]; (iv) a function derivs for the ODE integration; and 
other obvious parameters as described in the header comment above. 

In §17.4 we give a sample program illustrating how to use shoot. 


CITED REFERENCES AND FURTHER READING: 

Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America). 

Keller, H.B. 1968, Numerical Methods for Two-Point Boundary-Value Problems (Waltham, MA: 
Blaisdell). 
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17.2 Shooting to a Fitting Point 

The shooting method described in §17.1 tacitly assumed that the “shots” would 
be able to traverse the entire domain of integration, even at the early stages of 
convergence to a correct solution. In some problems it can happen that, for very 
wrong starting conditions, an initial solution can’t even get from x i to x -2 without 
encountering some incalculable, or catastrophic, result. For example, the argument 
of a square root might go negative, causing the numerical code to crash. Simple 
shooting would be stymied. 

A different, but related, case is where the endpoints are both singular points 
of the set of ODEs. One frequently needs to use special methods to integrate near 
the singular points, analytic asymptotic expansions, for example. In such cases it is 
feasible to integrate in the direction away from a singular point, using the special 
method to get through the first little bit and then reading off “initial” values for 
further numerical integration. However it is usually not feasible to integrate into 
a singular point, if only because one has not usually expended the same analytic 
effort to obtain expansions of “wrong” solutions near the singular point (those not 
satisfying the desired boundary condition). 

The solution to the above mentioned difficulties is shooting to a fitting point. 
Instead of integrating from X\ to x 2 , we integrate first from x \ to some point x / that 
is between x\ and x 2 ; and second from x% (in the opposite direction) to x /. 

If (as before) the number of boundary conditions imposed at x 1 is m, and the 
number imposed at X 2 is n 2 , then there are 122 freely specifiable starting values at 
x\ and rii freely specifiable starting values at a; 2. (If you are confused by this, go 
back to §17.1.) We can therefore define an n 2 -vector V(!) of starting parameters 
at xi, and a prescription loadl (xl, vl ,y) for mapping V ( 1 ) into a y that satisfies 
the boundary conditions at x±, 

y i (x 1 ) = y i (x 1 -,V (1)1 ,...,V (1)n J i=l,...,N (17.2.1) 

Likewise we can define an ni-vector V( 2 ) of starting parameters at x 2 , and a 
prescription load2 (x2, v2, y) for mapping V ( 2 ) into a y that satisfies the boundary 
conditions at X 2 , 

yi(x 2 ) = yi(x 2 ;V {2)1 ,... ,V( 2 ) ni ) i=l,...,N (17.2.2) 

We thus have a total of N freely adjustable parameters in the combination of 
V(i) and V( 2 ). The N conditions that must be satisfied are that there be agreement 
in N components of y at Xf between the values obtained integrating from one side 
and from the other. 


yi(xr,\ {1) ) = yi(x r ,Y {2) ) i=l,...,N (17.2.3) 

In some problems, the N matching conditions can be better described (physically, 
mathematically, or numerically) by using N different functions F. t . i = 1... N, each 
possibly depending on the N components y t . In those cases, (17.2.3) is replaced by 



^[y(^;V(i))] = F i [y(x / ;V (2 ))] 


i= 


(17.2.4) 
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In the program below, the user-supplied function score (xf, y, f) is supposed 
to map an input iV-vector y into an output iV-vector F. In most cases, you can 
dummy this function as the identity mapping. 

Shooting to a fitting point uses globally convergent Newton-Raphson exactly 
as in §17.1. Comparing closely with the routine shoot of the previous section, you 
should have no difficulty in understanding the following routine shootf. The main 
differences in use are that you have to supply both loadl and load2. Also, in the 
calling program you must supply initial guesses for vl [1. . n2] and v2 [1. . nl]. 
Once again a sample program illustrating shooting to a fitting point is given in §17.4. 

#include "nrutil.h" 

#define EPS 1.0e-6 

extern int nn2,nvar; Variables that you must define and set in your main pro- 

extern float xl,x2,xf; gram. 

int kmax.kount; Communicates with odeint. 

float *xp,**yp,dxsav; 

void shootf(int n, float v[], float f []) 

Routine for use with newt to solve a two point boundary value problem for nvar coupled 
ODEs by shooting from xl and x2 to a fitting point xf. Initial values for the nvar ODEs at 
xl (x2) are generated from the n2 (nl) coefficients vl (v2) , using the user-supplied routine 
loadl (load2) . The coefficients vl and v2 should be stored in a single array v[l. .nl+n2] 
in the main program by statements of the form vl=v; and v2 = &v[n2] ;. The input param¬ 
eter n = nl + n2 = nvar. The routine integrates the ODEs to xf using the Runge-Kutta 
method with tolerance EPS, initial stepsize hi, and minimum stepsize hmin. At xf it calls the 
user-supplied routine score to evaluate the nvar functions fl and f2 that ought to match 
at xf. The differences f are returned on output, newt uses a globally convergent Newton’s 
method to adjust the values of v until the functions f are zero. The user-supplied routine 
derivs(x,y,dydx) supplies derivative information to the ODE integrator (see Chapter 16). 
The first set of global variables above receives its values from the main program so that shoot 
can have the syntax required for it to be the argument vecfunc of newt. Set nn2 = n2 in 
the main program. 

( 

void derivs(float x, float y[], float dydx[]); 
void loadl (float xl, float vl [] , float y[]); 
void load2 (float x2, float v2[], float y[l); 
void odeint(float ystart[], int nvar, float xl, float x2, 
float eps, float hi, float hmin, int *nok, int *nbad, 
void (*derivs)(float, float [] , float []), 

void (*rkqs)(float [], float [], int, float *, float, float, 
float [], float *, float *, void (*) (float, float [], float []))); 
void rkqs(float y[], float dydx[], int n, float *x, 

float htry, float eps, float yscalt] , float *hdid, float *hnext, 
void (*derivs)(float, float [], float [])); 
void score (float xf, float y[] , float f[]); 
int i,nbad,nok; 
float hl,hmin=0.0,*fl,*f2,*y; 

fl=vector(1,nvar); 
f2=vector(1,nvar); 
y=vector(1,nvar); 
kmax=0; 

hl=(x2-xl)/100.0; 

loadl (xl, v, y); Path from xl to xf with best trial values vl. 

odeint(y,nvar,xl,xf,EPS,hi, hmin,took,tobad,derivs,rkqs); 
score(xf,y,fl); 

load2(x2,&v[nn2] ,y); Path from x2 to xf with best trial values v2. 

odeint(y,nvar,x2,xf,EPS,hi, hmin,took,tobad,derivs,rkqs); 
score(xf,y,f2); 
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for (i=l;i<=n;i++) f[i]=f1[i]-f2[i]; 
free_vector(y,1,nvar); 
free_vector(f2,l,nvar); 
free_vector(f1,1,nvar); 


There are boundary value problems where even shooting to a fitting point fails 
— the integration interval has to be partitioned by several fitting points with the 
solution being matched at each such point. For more details see [1], 

CITED REFERENCES AND FURTHER READING: 

Acton, F.S. 1970, Numerical Methods That Work ; 1990, corrected edition (Washington: Mathe¬ 
matical Association of America). 

Keller, H.B. 1968, Numerical Methods for Two-Point Boundary-Value Problems (Waltham, MA: 
Blaisdell). 

Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
§§7.3.5-7.3.6. [1] 


17.3 Relaxation Methods 



In relaxation methods we replace ODEs by approximate finite-difference equations 
(FDEs) on a grid or mesh of points that spans the domain of interest. As a typical example, 
we could replace a general first-order differential equation 

^.=9{x,y) (17.3.1) 

with an algebraic equation relating function values at two points k,k — 1: 

Vk - Vk-i - ( x k - x k - i) g [|(zic + x k -i), |(t ik + Vk-i)] = 0 (17.3.2) 

The form of the FDE in (17.3.2) illustrates the idea, but not uniquely: There are many 
ways to turn the ODE into an FDE. When the problem involves N coupled first-order ODEs 
represented by FDEs on a mesh of M points, a solution consists of values for N dependent 
functions given at each of the M mesh points, or N x M variables in all. The relaxation 
method determines the solution by starting with a guess and improving it, iteratively. As the 
iterations improve the solution, the result is said to relax to the true solution. 

While several iteration schemes are possible, for most problems our old standby, multi¬ 
dimensional Newton’s method, works well. The method produces a matrix equation that 
must be solved, but the matrix takes a special, “block diagonal” form, that allows it to be 
inverted far more economically both in time and storage than would be possible for a general 
matrix of size ( MN ) x ( MN ). Since MN can easily be several thousand, this is crucial 
for the feasibility of the method. 

Our implementation couples at most pairs of points, as in equation 
(17.3.2). More points can be coupled, but then the method becomes more complex. 
We will provide enough background so that you can write a more general scheme if you 
have the patience to do so. 

Let us develop a general set of algebraic equations that represent the ODEs by FDEs. The 
ODE problem is exactly identical to that expressed in equations (17.0.1)-(17.0.3) where we had 
N coupled first-order equations that satisfy m boundary conditions at x\ and ri 2 = N — n\ 
boundary conditions at * 2 . We first define a mesh or grid by a set of k = 1,2,..., M points 
at which we supply values for the independent variable Xk . In particular, xi is the initial 
boundary, and xm is the final boundary. We use the notation y k to refer to the entire set of 
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dependent variables yi, yi, • • •, Vn at point x k . At an arbitrary point k in the middle of the 
mesh, we approximate the set of N first-order ODEs by algebraic relations of the form 

0 = E k =y k -y k _ 1 ~(x k -x k -i)g k (xk,Xk-i,y k ,yk-i)^ k = 2,3,...,M (17.3.3) 


The notation signifies that g fc can be evaluated using information from both points k, k — 1. 
The FDEs labeled by E*, provide N equations coupling 2N variables at points k. k — 1. There 
are M — 1 points, k = 2, 3,..., M, at which difference equations of the form (17.3.3) apply. 
Thus the FDEs provide a total of (M — 1)JV equations for the MN unknowns. The remaining 
N equations come from the boundary conditions. 

At the first boundary we have 

0-E, -B(.r,, yi ') (17.3.4) 

while at the second boundary 

0 = Em+j $ C(xm, y M ) (17.3.5) 

The vectors Ei and B have only m nonzero components, corresponding to the ni boundary 
conditions at x\. It will turn out to be useful to take these nonzero components to be the 
last m components. In other words, Ej,i 0 only for j = m + 1 ,112 + 2,N. At 
the other boundary, only the first ri .2 components of Em+i and C are nonzero: Ej.M+i ^ 0 
only for j = 1,2,..., ri 2 . 

The “solution” of the FDE problem in (17.3.3)—(17.3.5) consists of a set of variables 
the values of the N variables y :l at the M points x k . The algorithm we describe 
below requires an initial guess for the y :h k- We then determine increments Ayj ik such that 
y.j,k + Ay.j,k is an improved approximation to the solution. 

Equations for the increments are developed by expanding the FDEs in first-order Taylor 
series with respect to small changes Ay k . At an interior point, k = 2,3,..., M this gives: 

E fc(y fc + Ay fc , y fc _! + Ay^) w E fc (y fc ,y fc-1 ) 


+ E 

n=1 


dE k 

dy n ,k- 


Ay n ,k-i + l 


dE k 

dy n ,k 


Ay„ :k 


(17.3.6) 


For a solution we want the updated value E(y + Ay) to be zero, so the general set of equations 
at an interior point can be written in matrix form as 


N 2 N 

^2S Sin Ay n ,k-i+ Yl Si,nAy n - Nik = -E j ,k, j = 1,2,..., N (17.3.7) 

n= 1 n=N +1 

where 

•S’,,, - . S jtn+N = ^i*, n=l,2,...,N (17.3.8) 

oy„,k-i oy n ,k 

The quantity Sj, n is an N x 2 N matrix at each point k. Each interior point thus supplies a 
block of N equations coupling 2 N corrections to the solution variables at the points k, k — 1. 

Similarly, the algebraic relations at the boundaries can be expanded in a first-order 
Taylor series for increments that improve the solution. Since Ei depends only on y,, we 
find at the first boundary: 


N 

Y Sj, n Ay„,i = -E jt 1 , j = n 2 + 1, n 2 + 2,... , TV 

71=1 

where 

S jn sm 1 =^, n = 1,2,... ,N 

dy n ,i 

At the second boundary, 

N 

Y ^J>A y n ,M = -Ej,m+ 1 , j = 1,2,..., n 2 


(17.3.9) 


(17.3.10) 



(17.3.11) 
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where 

Sj n = dl ,y- M ‘ 1 , 2. ... .N (17.3.12) 

oy„,M 

We thus have in equations (17.3.7)—(17.3.12) a set of linear equations to be solved for 
the corrections Ay, iterating until the corrections are sufficiently small. The equations have 
a special structure, because each Sj, n couples only points k,k — 1. Figure 17.3.1 illustrates 
the typical structure of the complete matrix equation for the case of 5 variables and 4 mesh 
points, with 3 boundary conditions at the first boundary and 2 at the second. The 3x5 
block of nonzero entries in the top left-hand corner of the matrix comes from the boundary 
condition S'j. n at point k — 1 . The next three 5 x 10 blocks are the S :hn at the interior 
points, coupling variables at mesh points (2,1), (3,2), and (4,3). Finally we have the block 
corresponding to the second boundary condition. 

We can solve equations (17.3.7)-(17.3.12) for the increments Ay using a form of 
Gaussian elimination that exploits the special structure of the matrix to minimize the total 
number of operations, and that minimizes storage of matrix coefficients by packing the 
elements in a special blocked structure. (You might wish to review Chapter 2, especially 
§2.2, if you are unfamiliar with the steps involved in Gaussian elimination.) Recall that 
Gaussian elimination consists of manipulating the equations by elementary operations such 
as dividing rows of coefficients by a common factor to produce unity in diagonal elements, 
and adding appropriate multiples of other rows to produce zeros below the diagonal. Flere 
we take advantage of the block structure by performing a bit more reduction than in pure 
Gaussian elimination, so that the storage of coefficients is minimized. Figure 17.3.2 shows 
the form that we wish to achieve by elimination, just prior to the backsubstitution step. Only a 
small subset of the reduced MN x M N matrix elements needs to be stored as the elimination 
progresses. Once the matrix elements reach the stage in Figure 17.3.2, the solution follows 
quickly by a backsubstitution procedure. 

Furthermore, the entire procedure, except the backsubstitution step, operates only on 
one block of the matrix at a time. The procedure contains four types of operations: (1) 
partial reduction to zero of certain elements of a block using results from a previous step, 
(2) elimination of the square structure of the remaining block elements such that the square 
section contains unity along the diagonal, and zero in off-diagonal elements, (3) storage of the 
remaining nonzero coefficients for use in later steps, and (4) backsubstitution. We illustrate 
the steps schematically by figures. 

Consider the block of equations describing corrections available from the initial boundary 
conditions. We have n i equations for N unknown corrections. We wish to transform the first 
block so that its left-hand m x m square section becomes unity along the diagonal, and zero 
in off-diagonal elements. Figure 17.3.3 shows the original and final form of the first block 
of the matrix. In the figure we designate matrix elements that are subject to diagonalization 
by “D”, and elements that will be altered by “A”; in the final block, elements that are stored 
are labeled by “S”. We get from start to finish by selecting in turn m “pivot” elements from 
among the first ni columns, normalizing the pivot row so that the value of the “pivot” element 
is unity, and adding appropriate multiples of this row to the remaining rows so that they 
contain zeros in the pivot column. In its final form, the reduced block expresses values for the 
corrections to the first m variables at mesh point 1 in terms of values for the remaining rvi 
unknown corrections at point 1, i.e., we now know what the first ni elements are in terms of 
the remaining n 2 elements. We store only the final set of rv± nonzero columns from the initial 
block, plus the column for the altered right-hand side of the matrix equation. 

We must emphasize here an important detail of the method. To exploit the reduced 
storage allowed by operating on blocks, it is essential that the ordering of columns in the 
s matrix of derivatives be such that pivot elements can be found among the first ni rows 
of the matrix. This means that the n 1 boundary conditions at the first point must contain 
some dependence on the first j=l,2,.. . ,m dependent variables, y[j] [1]. If not, then the 
original square ni x m subsection of the first block will appear to be singular, and the method 
will fail. Alternatively, we would have to allow the search for pivot elements to involve all 
N columns of the block, and this would require column swapping and far more bookkeeping. 
The code provides a simple method of reordering the variables, i.e., the columns of the s 
matrix, so that this can be done easily. End of important detail. 
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Figure 17.3.1. Matrix structure of a set of linear finite-difference equations (FDEs) with boundary 
conditions imposed at both endpoints. Here X represents a coefficient of the FDEs, V represents a 
component of the unknown solution vector, and B is a component of the known right-hand side. Empty 
spaces represent zeros. The matrix equation is to be solved by a special form of Gaussian elimination. 
(See text for details.) 
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Figure 17.3.2. Target structure of the Gaussian elimination. Once the matrix of Figure 17.3.1 has been 
reduced to this form, the solution follows quickly by backsubstitution. 
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(a) D D D A A V A 

D D D A A V A 

D D D A A V A 

(b) 1 0 0 S S V S 

OIOSS V S 

0 0 1 S S V S 


Figure 17.3.3. Reduction process for the first (upper left) block of the matrix in Figure 17.3.1. (a) 
Original form of the block, (b) final form. (See text for explanation.) 


(a) 1 0 0 S S VS 

OIOSS VS 

0 0 1 S S VS 

ZZZDDDDDAA V A 

ZZZDDDDDAA V A 

ZZZDDDDDAA V A 

ZZZDDDDDAA V A 

ZZZDDDDDAA V A 

(b) 1 0 0 S S VS 

OIOSS VS 

0 0 1 S S VS 

00010000SS V S 

00001000SS V s 

00000100SS V s 

00000010SS V s 

00000001SS V s 


Figure 17.3.4. Reduction process for intermediate blocks of the matrix in Figure 17.3.1. (a) Original 
form, (b) final form. (See text for explanation.) 



(a) 00010000SS V S 

00001000SS V S 

00000100SS V s 

00000010SS V s 

00000001SS V s 

Z Z Z D D V A 

Z Z Z D D V A 

(b) 00010000SS V S 

00001000SS V s 

00000100SS V s 

00000010SS V s 

00000001SS V s 

0 0 0 1 0 V s 

0 0 0 0 1 V s 

Figure 17.3.5. Reduction process for the last (lower right) block of the matrix in Figure 17.3.1. (a) 
Original form, (b) final form. (See text for explanation.) 
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Next consider the block of N equations representing the FDEs that describe the relation 
between the 2N corrections at points 2 and 1. The elements of that block, together with results 
from the previous step, are illustrated in Figure 17.3.4. Note that by adding suitable multiples 
of rows from the first block we can reduce to zero the first m columns of the block (labeled 
by “Z”), and, to do so, we will need to alter only the columns from m + 1 to N and the 
vector element on the right-hand side. Of the remaining columns we can diagonalize a square 
subsection of N x N elements, labeled by “D” in the figure. In the process we alter the final 
set of ri 2 + 1 columns, denoted “A” in the figure. The second half of the figure shows the 
block when we finish operating on it, with the stored (n 2 + 1) x N elements labeled by “S.” 

If we operate on the next set of equations corresponding to the FDEs coupling corrections 
at points 3 and 2, we see that the state of available results and new equations exactly reproduces 
the situation described in the previous paragraph. Thus, we can carry out those steps again 
for each block in turn through block M. Finally on block M + lwe encounter the remaining 
boundary conditions. 

Figure 17.3.5 shows the final block of n 2 FDEs relating the N corrections for variables 
at mesh point M, together with the result of reducing the previous block. Again, we can first 
use the prior results to zero the first m columns of the block. Now, when we diagonalize 
the remaining square section, we strike gold: We get values for the final n 2 corrections 
at mesh point M. 

With the final block reduced, the matrix has the desired form shown previously in 
Figure 17.3.2, and the matrix is ripe for backsubstitution. Starting with the bottom row and 
working up towards the top, at each stage we can simply determine one unknown correction 
in terms of known quantities. 

The function solvde organizes the steps described above. The principal procedures 
used in the algorithm are performed by functions called internally by solvde. The function 
red eliminates leading columns of the s matrix using results from prior blocks, pinvs 
diagonalizes the square subsection of s and stores unreduced coefficients, bksub carries 
out the backsubstitution step. The user of solvde must understand the calling arguments, 
as described below, and supply a function difeq, called by solvde, that evaluates the s 
matrix for each block. 

Most of the arguments in the call to solvde have already been described, but some 
require discussion. Array y [ j ] [k] contains the initial guess for the solution, with j labeling 
the dependent variables at mesh points k. The problem involves ne FDEs spanning points 
k=l,. . . , m. nb boundary conditions apply at the first point k=l. The array indexv[j] 
establishes the correspondence between columns of the s matrix, equations (17.3.8), (17.3.10), 
and (17.3.12), and the dependent variables. As described above it is essential that the nb 
boundary conditions at k=l involve the dependent variables referenced by the first nb columns 
of the s matrix. Thus, columns j of the s matrix can be ordered by the user in difeq to refer 
to derivatives with respect to the dependent variable indexv [j ]. 

The function only attempts itmax correction cycles before returning, even if the solution 
has not converged. The parameters conv, slowc, scalv relate to convergence. Each 
inversion of the matrix produces corrections for ne variables at m mesh points. We want these 
to become vanishingly small as the iterations proceed, but we must define a measure for the 
size of corrections. This error “norm” is very problem specific, so the user might wish to 
rewrite this section of the code as appropriate. In the program below we compute a value 
for the average correction err by summing the absolute value of all corrections, weighted by 
a scale factor appropriate to each type of variable: 



1 |Ay[j]M 

m X ne ^ scalv [j] 
k=lj=l 


(17.3.13) 


S, § g 

^ 5 ; 


When err<conv, the method has converged. Note that the user gets to supply an array scalv 
which measures the typical size of each variable. 

Obviously, if err is large, we are far from a solution, and perhaps it is a bad idea 
to believe that the corrections generated from a first-order Taylor series are accurate. The 
number slowc modulates application of corrections. After each iteration we apply only a 
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fraction of the corrections found by matrix inversion: 

Yl j] [k] -+ Yl j] [k] +-- SloW °--AF[j] [k] (17.3.14) 

max(slowc,err) 

Thus, when err>slowc only a fraction of the corrections are used, but when err<slowc 
the entire correction gets applied. 

The call statement also supplies solvde with the array y[l. .nyj] [1. .nyk] con¬ 
taining the initial trial solution, and workspace arrays c[l. .ne] [1. .ne-nb+1] [1. .m+1], 
s [1. .ne] [1. .2*ne+l]. The array c is the blockbuster: It stores the unreduced elements 
of the matrix built up for the backsubstitution step. If there are m mesh points, then there 
will be m+1 blocks, each requiring ne rows and ne-nb+1 columns. Although large, this 
is small compared with (nexm) 2 elements required for the whole matrix if we did not 
break it into blocks. 

We now describe the workings of the user-supplied function difeq. The synopsis of 
the function is 

void difeq(int k, int kl, int k2, int jsf, int isl, int isf, 
int indexv [], int ne, float **s, float **y); 

The only information passed from difeq to solvde is the matrix of derivatives 
s [1. .ne] [1. .2*ne+l]; all other arguments are input to difeq and should not be altered, 
k indicates the current mesh point, or block number, kl ,k2 label the first and last point in 
the mesh. If k=kl or k>k2, the block involves the boundary conditions at the first or final 
points; otherwise the block acts on FDEs coupling variables at points k-1, k. 

The convention on storing information into the array s[i] [j] follows that used in 
equations (17.3.8), (17.3.10), and (17.3.12): Rows i label equations, columns j refer to 
derivatives with respect to dependent variables in the solution. Recall that each equation will 
depend on the ne dependent variables at either one or two points. Thus, j runs from 1 to 
either ne or 2*ne. The column ordering for dependent variables at each point must agree 
with the list supplied in indexv [j]. Thus, for a block not at a boundary, the first column 
multiplies AY (l=indexv[l] ,k-l), and the column ne+1 multiplies AY (l=indexv[l] ,k). 
isl, isf give the numbers of the starting and final rows that need to be filled in the s matrix 
for this block, jsf labels the column in which the difference equations Ej : k of equations 
(17.3.3)—(17.3.5) are stored. Thus, —s[i] [jsf] is the vector on the right-hand side of the 
matrix. The reason for the minus sign is that difeq supplies the actual difference equation, 
Ej t k, not its negative. Note that solvde supplies a value for jsf such that the difference 
equation is put in the column just after all derivatives in the s matrix. Thus, difeq expects to 
find values entered into s[i] [j] for rows isl < i < isf and 1 < j < jsf. 

Finally, s[l. .nsi] [1. .nsj] and y[l. .nyj] [1. .nyk] supply difeq with storage 
for s and the solution variables y for this iteration. An example of how to use this routine 
is given in the next section. 

Many ideas in the following code are due to Eggleton[1 ]. 

tinclude <stdio.h> 

#include <math.h> 

#include "nrutil.h" 

void solvde(int itmax, float conv, float slowc, float scalv[], int indexv[], 
int ne, int nb, int m, float **y, float ***c, float **s) 

Driver routine for solution of two point boundary value problems by relaxation, itmax is the 
maximum number of iterations, conv is the convergence criterion (see text), slowc controls 
the fraction of corrections actually used after each iteration. scalv[l. .ne] contains typical 
sizes for each dependent variable, used to weight errors, indexv[1. .ne] lists the column 
ordering of variables used to construct the matrix s [1. .ne] [1. . 2*ne+l] of derivatives. (The 
nb boundary conditions at the first mesh point must contain some dependence on the first nb 
variables listed in indexv.) The problem involves ne equations for ne adjustable dependent 
variables at each point. At the first mesh point there are nb boundary conditions. There are a 
total of m mesh points, y [1. .ne] [1. .m] is the two-dimensional array that contains the initial 
guess for all the dependent variables at each mesh point. On each iteration, it is updated by the 
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calculated correction. The arrays c[l. .ne] [1. .ne-nb+1] [1. .m+1] and s supply dummy 
storage used by the relaxation code. 

{ 

void bksub(int ne, int nb, int jf, int kl, int k2, float ***c); 
void difeq(int k, int kl, int k2, int jsf, int isl, int isf, 
int indexv[] , int ne, float **s, float **y); 
void pinvs(int iel, int ie2, int jel, int jsf, int jcl, int k, 
float ***c, float **s); 

void red(int izl, int iz2, int jzl, int jz2, int jml, int jm2, int jmf, 
int icl, int jcl, int jcf, int kc, float ***c, float **s); 
int icl,ic2,ic3,ic4,it,j,jl,j2,j3,j4,j5,j6,j7,j8,j9; 
int jcl,jcf,jv,k,kl,k2,km,kp,nvars,*kmax; 
float err,errj,fac,vmax,vz,*ermax; 

kmax=ivector(l,ne); 
ermax=vector(l,ne); 

kl=l; Set up row and column markers. 

k2=m; 

nvars=ne*m; 


j 2=nb; 
j3=nb+l; 
j 4=ne; 
j5=j4+j1; 
j 6=j4+j2; 
j7=j4+j3; 


j9=j8+jl; 

icl=l; 
ic2=ne-nb; 
ic3=ic2+l; 
ic4=ne; 


jci=i; 

jcf=ic3; 

for (it=l;it<=itmax;it++) { Primary iteration loop. 

k=kl; Boundary conditions at first point. 

difeq(k,kl,k2,j9,ic3,ic4,indexv,ne,s,y); 
pinvs(ic3,ic4,j5,j9,jcl,kl,c,s); 

for (k=kl+l;k<=k2;k++) { Finite difference equations at all point pairs. 


kp=k-l; 

difeq(k,kl,k2,j9,icl,ic4,indexv,ne,s,y); 
redficl,ic4,j1,j2,j3,j4,j9,ic3,j cl,jcf,kp,c,s); 
pinvs(icl,ic4,j3,j9,jcl,k,c,s); 


k=k2+l; Final boundary conditions. 

difeq(k,kl,k2,j9,icl,ic2,indexv,ne,s,y); 

red(icl,ic2,j5,j6,j7,j8,j9,ic3,jcl,jcf,k2,c,s); 

pinvs(icl,ic2,j7,j9,jcf,k2+l,c,s); 

bksub(ne,nb,jcf ,kl,k2,c); Backsubstitution. 

err=0.0; 

for (j=l;j<=ne; j++) { Convergence check, accumulate average er- 

jv=indexv[j] ; ror. 

errj =vmax=0.0; 
km=0; 


for (k=kl;k<=k2;k++) { Find point with largest error, for each de- 

vz=fabs(c[jv] [1] [k]); pendent variable, 

if (vz > vmax) { 


vmax=vz; 
km=k; 

> 

errj += vz; 

> 

err += errj/scalv[j] ; Note weighting for each dependent variable, 

ermax[j] =c[jv] [1] [km]/scalv[j] ; 
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kmax[j]=km; 

> 

err /= nvars; 

fac=(err > slowc ? slowc/err : 1.0); 

Reduce correction applied when error is large. 

for (j=l;j<=ne; j++) { Apply corrections, 

j v=indexv[j]; 
for (k=kl;k<=k2;k++) 

y [j] [k] -= fac*c [jv] [1] [k] ; 

> 

printf ("\n’/,8s ’/,9s ’/,9s\n" , "Iter"Error" , "FAC") ; Summary of corrections 

printf ("’/,6d ’/.12.6f ’/„11.6f\n" ,it,err,fac) ; for this step. 

if (err < conv) { Point with largest error for each variable can 

free_vector(ermax,l,ne); be monitored by writing out kmax and 

free.ivector(kmax,1,ne); ermax. 

return; 

> 

> 

nrerror("Too many iterations in solvde"); Convergence failed. 

> 


void bksub(int ne, int nb, int jf, int kl, int k2, float ***c) 

Backsubstitution, used internally by solvde. 

{ 

int nbf, im, kp, k, j , i; 
float xx; 

nbf=ne-nb; 
im=l; 

for (k=k2;k>=kl;k—) { Use recurrence relations to eliminate remaining de- 

if (k == kl) im=nbf+l; pendences. 

kp=k+l; 

for (j=1;j<=nbf;j++) { 
xx=c[j][jf] [kp]; 
for (i=im;i<=ne;i++) 

c[i] [jf] [k] -= c[i] [j] [k] *xx; 

> 

> 

for (k=kl;k<=k2;k++) { Reorder corrections to be in column 1. 

kp=k+l; 

for (i=l;i<=nb;i++) c[i] [1][k]=c[i+nbf] [jf] [k] ; 
for (i=l;i<=nbf; i++) c [i+nb] [1] [k] =c[i] [jf] [kp] ; 

> 

> 


#include <math.h> 

#include "nrutil.h" 

void pinvs(int iel, int ie2, int jel, int jsf, int jcl, int k, float ***c, 
float **s) 

Diagonalize the square subsection of the s matrix, and store the recursion coefficients in c; 
used internally by solvde. 

{ 

int j si, jpiv, jp, j e2, j cof f, j , irow, ipiv, id, icof f, i, *indxr; 
float pivinv,piv,dum,big,*pscl; 

indxr=ivector(iel,ie2); 
pscl=vector(iel,ie2); 
je2=jel+ie2-iel; 
jsl=je2+l; 
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for (i=iel;i<=ie2;i++) { Implicit pivoting, as in §2.1. 

big=0.0; 

for (j=jel;j<=je2;j++) 

if (fabs(s [i] [j]) > big) big=fabs(s [i] [j]); 
if (big == 0.0) nrerror("Singular matrix - row all 0, in pinvs"); 
pscl[i]=1.0/big; 
indxr [i] =0; 

> 

for (id=iel;id<=ie2;id++) { 
piv=0.0; 

for (i=iel;i<=ie2;i++) { Find pivot element, 

if (indxr[i] == 0) { 
big=0.0; 

for (j=jel;j<=je2;j++) { 

if (f abs (s [i] [j]) > big) { 

jp=j; 

big=fabs(s [i] [j]); 

> 

> 

if (big*pscl[i] > piv) { 
ipiv=i; 
jpiv=jp; 

piv=big*pscl[i] ; 

> 

> 

> 

if (s[ipiv][jpiv] == 0.0) nrerror("Singular matrix in routine pinvs"); 
indxr [ipiv]= j piv; In place reduction. Save column ordering. 

pivinv=l.0/s[ipiv][jpiv] ; 

for (j=jel; j<=jsf; j++) s[ipiv][j] *= pivinv; Normalize pivot row. 
s [ipiv] [jpiv] =1.0; 

for (i=iel;i<=ie2;i++) { Reduce nonpivot elements in column, 

if (indxr[i] != jpiv) { 
if (s [i] [jpiv]) { 
dum=s [i] [jpiv] ; 
for (j=jel;j<=jsf;j++) 

s[i][j] -= dum*s[ipiv] [j] ; 
s[i] [jpiv] =0.0; 

> 

> 

> 

> 

jcoff=jcl-jsl; Sort and store unreduced coefficients. 

icoff=iel-jel; 
for (i=iel;i<=ie2;i++) { 
irow=indxr[i]+icoff; 

for (j=jsl; j<=jsf; j++) c[irow] [j+jcoff] [k]=s[i] [j] ; 

> 

free_vector(pscl,iel,ie2); 
free.ivector(indxr,iel,ie2); 


void red(int izl, int iz2, int jzl, int jz2, int jml, int jm2, int jmf, 
int icl, int jcl, int jcf, int kc, float ***c, float **s) 

Reduce columns jzl-jz2 of the s matrix, using previous results as stored in the c matrix. Only 
columns jml-jm2, jmf are affected by the prior results, red is used internally by solvde. 

{ 

int loff,1,j,ic,i; 
float vx; 

loff=jcl-jml; 
ic=icl; 
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for (j=jzl; j<=jz2; j++) { Loop over columns to be zeroed, 

for (l=jml;l<=jm2;l++) { Loop over columns altered. 

vx=c[ic] [1+loff] [kc] ; 

for (i=izl;i<=iz2;i++) s[i][l] -= s[i][j]*vx; Loop over rows. 

> 

vx=c [ic] [jcf] [kc] ; 

for (i=izl;i<=iz2;i++) s[i] [jmf] -= s[i][j]*vx; Plus final element, 

ic += 1; 

} 

> 


“Algebraically Difficult” Sets of Differential Equations 

Relaxation methods allow you to take advantage of an additional opportunity that, while 
not obvious, can speed up some calculations enormously. It is not necessary that the set 
of variables yj.k correspond exactly with the dependent variables of the original differential 
equations. They can be related to those variables through algebraic equations. Obviously, it 
is necessary only that the solution variables allow us to evaluate the functions y, g, B, C that 
are used to construct the FDEs from the ODEs. In some problems g depends on functions of 
y that are known only implicitly, so that iterative solutions are necessary to evaluate functions 
in the ODEs. Often one can dispense with this “internal” nonlinear problem by defining 
a new set of variables from which both y, g and the boundary conditions can be obtained 
directly. A typical example occurs in physical problems where the equations require solution 
of a complex equation of state that can be expressed in more convenient terms using variables 
other than the original dependent variables in the ODE. While this approach is analogous to 
performing an analytic change of variables directly on the original ODEs, such an analytic 
transformation might be prohibitively complicated. The change of variables in the relaxation 
method is easy and requires no analytic manipulations. 

CITED REFERENCES AND FURTHER READING: 

Eggleton, P.P. 1971, Monthly Notices of the Royal Astronomical Society, vol. 151, pp. 351-364. [1] 
Keller, H.B. 1968, Numerical Methods for Two-Point Boundary-Value Problems (Waltham, MA: 
Blaisdell). 

Kippenhan, R., Weigert, A., and Hofmeister, E. 1968, in Methods in Computational Physics, 
vol. 7 (New York: Academic Press), pp. 129ff. 


17.4 A Worked Example: Spheroidal Harmonics 


The best way to understand the algorithms of the previous sections is to see 
them employed to solve an actual problem. As a sample problem, we have selected 
the computation of spheroidal harmonics. (The more common name is spheroidal 
angle functions, but we prefer the explicit reminder of the kinship with spherical 
harmonics.) We will show how to find spheroidal harmonics, first by the method 
of relaxation (§17.3), and then by the methods of shooting (§17.1) and shooting 
to a fitting point (§17.2). 

Spheroidal harmonics typically arise when certain partial differential 
equations are solved by separation of variables in spheroidal coordinates. They 
satisfy the following differential equation on the interval — 1 < x < 1: 


d_ 

dx 



+ 


A - c 2 x 2 


m 2 \ 
1 — x 2 ) 



S = 0 


(17.4.1) 
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Here m is an integer, c is the “oblateness parameter,” and A is the eigenvalue. Despite 
the notation, c 2 can be positive or negative. For c 2 > 0 the functions are called 
“prolate,” while if c 2 < 0 they are called “oblate.” The equation has singular points 
at x = ±1 and is to be solved subject to the boundary conditions that the solution be 
regular at x = ±1. Only for certain values of A, the eigenvalues, will this be possible. 

If we consider first the spherical case, where c = 0, we recognize the differential 
equation for Legendre functions P™(x). In this case the eigenvalues are X rnn = 

n(n +1), n = m, m + 1,_ The integer n labels successive eigenvalues for 

fixed m: When n = m we have the lowest eigenvalue, and the corresponding 
eigenfunction has no nodes in the interval — 1 < x < 1; when n = m + 1 we have 
the next eigenvalue, and the eigenfunction has one node inside (—1,1); and so on. 

A similar situation holds for the general case c 2 ^ 0. We write the eigenvalues 
of (17.4.1) as A m n(c) and the eigenfunctions as S mn (x;c). For fixed m, n = 
m, to + 1,... labels the successive eigenvalues. 

The computation of A mn (c) and S rnn (x; c ) traditionally has been quite difficult. 
Complicated recurrence relations, power series expansions, etc., can be found 
in references [1-3], Cheap computing makes evaluation by direct solution of the 
differential equation quite feasible. 

The first step is to investigate the behavior of the solution near the singular 
points x = ±1. Substituting a power series expansion of the form 

S=( l±x) a J2 a k( 1±x ) k (17.4.2) 

fc=o 

in equation (17.4.1), we find that the regular solution has a = m/2. (Without loss 
of generality we can take to > 0 since to —* —m is a symmetry of the equation.) 
We get an equation that is numerically more tractable if we factor out this behavior. 
Accordingly we set 


S = (1 - x 2 ) m/2 y 

We then find from (17.4.1) that y satisfies the equation 

(1 - a; 2 )^-| - 2 (to + l)x^~ + (p- c 2 x 2 )y = 0 
ax z ax 


(17.4.3) 


(17.4.4) 


where 


p = A — to(to + 1) (17.4.5) 

Both equations (17.4.1) and (17.4.4) are invariant under the replacement 
x > —x. Thus the functions S and y must also be invariant, except possibly for an 
overall scale factor. (Since the equations are linear, a constant multiple of a solution 
is also a solution.) Because the solutions will be normalized, the scale factor can 
only be±l. If n — to is odd, there are an odd number of zeros in the interval (—1,1). 
Thus we must choose the antisymmetric solution y(—x) = —y(x) which has a zero 
at x = 0. Conversely, if n — m is even we must have the symmetric solution. Thus 

ymn(-x) = (17.4.6) 



s o- i 
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and similarly for S mn . 

The boundary conditions on (17.4.4) require that y be regular at x = ±1. In 
other words, near the endpoints the solution takes the form 


y = a o + ai(l — x 2 ) + 02(1 - x 2 ) 2 + ... (17.4.7) 

Substituting this expansion in equation (17.4.4) and letting x —> 1, we find that 


p — c 

a i = — 77— rr\ a ° 


4 (to + 1) 


(17.4.8 


Equivalently, 


ny“ (I) (17A9) 

A similar equation holds at a; = —1 with a minus sign on the right-hand side. 
The irregular solution has a different relation between function and derivative at 
the endpoints. 

Instead of integrating the equation from —1 to 1, we can exploit the symmetry 
(17.4.6) to integrate from 0 to 1. The boundary condition at x = 0 is 

3 /( 0 ) =0, n — m odd 

(/ (17.4.10) 

y (0) = 0, n — m even 

A third boundary condition comes from the fact that any constant multiple 
of a solution y is a solution. We can thus normalize the solution. We adopt the 
normalization that the function S mn has the same limiting behavior as P." 1 at x = 1: 

lim (1 - x 2 )- m/2 S mn (x;c) = lim(l - x 2 )~ m/2 P™(x ) (17.4.11) 

Various normalization conventions in the literature are tabulated by Flammer [1 ]. 

Imposing three boundary conditions for the second-order equation (17.4.4) 
turns it into an eigenvalue problem for A or equivalently for p. We write it in the 
standard form by setting 


yi=y 

(17.4.12) 

y2 = y ’ 

(17.4.13) 

2/3 = M 

(17.4.14) 



y'l = V2 

y 2 = i_ x 2 [ 2x ( m + 1 )^ 2 “ (^3 - c 2 x 2 )yi] 

y'3 — 0 


Then 


(17.4.15) 

(17.4.16) 

(17.4.17) 
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The boundary condition at x = 0 in this notation is 


yi = 0, n — m odd 
2/2 = 0, n — to even 


(17.4.18) 


At x = 1 we have two conditions: 

2/3 ~ c 2 


2/2 = 


2 (to + 1) 


y i 


(17.4.19) 


» - W 1 - *T m/a TO = £il£-")i' s * <17A20) 

We are now ready to illustrate the use of the methods of previous sections 
on this problem. 


Relaxation 


If we just want a few isolated values of A or S, shooting is probably the quickest 
method. However, if we want values for a large sequence of values of c, relaxation 
is better. Relaxation rewards a good initial guess with rapid convergence, and the 
previous solution should be a good initial guess if c is changed only slightly. 

For simplicity, we choose a uniform grid on the interval 0 < x < 1. For a 
total of M mesh points, we have 


h = 


1 

M- 1 


Xk = (k- 1 )h, 


k= 1,2 


(17.4.21) 

(17.4.22) 


At interior points k = 2,3,..., M, equation (17.4.15) gives 

Ei,k = 2/i, k ~ Vi,k-i ~ 2 (2/2,fc + 2 / 2 ,fc 1 ) (17.4.23) 

Equation (17.4.16) gives 
Ez,k = 2/2 ,k ~ 2/2,fe —1 — Pk 

\{x k +x k -i){m+ l)( 2 / 2 ,fc +2/2,fc-i) (2/1,fc + 2 /i,fc-i)l ( 17 - 4 - 24 ) 

X [-2--2-J 

where 


2/3,fc + 2/3,fc-l _ C 2 (x fc + Xfe-l) 2 
2 4 


1 - |(®fc + a?/t— 1) 2 

Finally, equation (17.4.17) gives 


(17.4.25) 

(17.4.26) 



Es,k = 2/3,fc — 2/3,fc-l 


(17.4.27) 
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Now recall that the matrix of partial derivatives Sij of equation (17.3.8) is 
defined so that i labels the equation and j the variable. In our case, j runs from 1 to 
3 for y : j at k — 1 and from 4 to 6 for yj at k. Thus equation (17.4.23) gives 


5 M = -1, 

5l,4 = 1, 

Similarly equation (17.4.24) yields 

^ 2,1 = OtkPk/ 2 , 

S‘2,3 = Pk(y i,k + 2/i,fc-i)/4 
S-2.5 = 2 + 52,2, 


5i,2 = — 2 > 5i,3 = 0 

5i,5 = -J, Si t6 = 0 


(17.4.28) 


^2,2 = — 1 — Pk( x k + x k- l)( m + l)/2, 

^2,4 = 52,1, 

52,6 = 52,3 

(17.4.29) 


while from equation (17.4.27) we find 


53 ,i = 0 , 53,2 = 0 , 53,3 •— —1 

53,4 = 0, 53,5 = 0, 53,6 = 1 


(17.4.30) 


At x = 0 we have the boundary condition 


-£ 3,1 = 


2 /i,i, n — m odd 
y 2 , 1 , n —m even 


(17.4.31) 


Recall the convention adopted in the solvde routine that for one boundary condition 
at k = 1 only S's,, can be nonzero. Also, j takes on the values 4 to 6 since the 
boundary condition involves only y^, not i)k -1 • Accordingly, the only nonzero 
values of 5a,, at x = 0 are 


53,4 = 1 , n — to odd 

£ 3,5 = 1 , n — to even 


At x = 1 we have 


c, 2/3 ,m - 

£i,M+ 1 = 2/2,M - 2(m + 1) 2/1,M 

E2,M+1 = 2 / 1 , M — 7 


(17.4.32) 


(17.4.33) 

(17.4.34) 


Thus 


51.4 

52.4 


2/3,M - c 2 

2(to + 1) ’ 


5l,6 = 1, 


52,5 = 0, 


5l,6 — — 


2/1,M 

2(to + 1) 


5 2 ,6 = 0 


(17.4.35) 

(17.4.36) 



Here now is the sample program that implements the above algorithm. We 
need a main program, sf roid, that calls the routine solvde, and we must supply 
the function dif eq called by solvde. For simplicity we choose an equally spaced 
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mesh of m = 41 points, that is, h = .025. As we shall see, this gives good accuracy 
for the eigenvalues up to moderate values of n — m. 

Since the boundary condition at x = 0 does not involve y \ if n — m is even, 
we have to use the indexv feature of solvde. Recall that the value of indexv [j] 
describes which column of s [i] [j] the variable y [j] has been put in. If n — m 
is even, we need to interchange the columns for y i and y -2 so that there is not a 
zero pivot element in s[i] [j]. 

The program prompts for values of to and n. It then computes an initial guess 
for y based on the Legendre function P™. It next prompts for c 2 , solves for y, 
prompts for c 2 , solves for y using the previous values as an initial guess, and so on. 

#include <stdio.h> 

#include <math.h> 

#include "nrutil.h" 

#define NE 3 
#define M 41 
#define NB 1 
#define NSI ME 
#define NYJ NE 
#define NYK M 
#define NCI NE 
#define NCJ (NE-NB+1) 

#define NCK (M+l) 

#define NSJ (2*NE+1) 

int mm,n,mpt=M; 

float h,c2=0.0,anorm,x[M+l] ; 

Global variables communicating with difeq. 

int main(void) /* Program sfroid */ 

Sample program using solvde. Computes eigenvalues of spheroidal harmonics S mn (x;c) for 
m > 0 and n > m. In the program, m is mm, c 2 is c2, and 7 of equation (17.4.20) is anorm. 

{ 

float plgndr(int 1, int m, float x); 

void solvde(int itmax, float conv, float slowc, float scalv[], 

int indexv[] , int ne, int nb, int m, float **y, float ***c, float **s); 
int i,itmax,k,indexv[NE+1]; 

float conv ) deriv ) facl,fac2 ) ql,slowc,scalv[NE+l]; 
float **y,**s,***c; 

y=matrix(1,NYJ,1,NYK); 
s=matrix(l,NSI,1,NSJ); 
c=f 3tensor (1, NCI, 1, NCJ, 1, NCK) 
itmax=100; 
conv=5.Oe-6; 
slowc=1.0; 
h=l.0/CM-1); 

printf("\nenter m n\n"); 
scanf("’/,d '/.d" ,&mm,&n); 
if (n+mm k 1) { 
indexv [1]=1; 
indexv[2]=2; 
indexv [3] =3; 

> else { 

indexv[1]=2; 
indexv [2] =1; 
indexv [3] =3; 

> 

anorm=l.0; 
if (mm) { 
ql=n; 


No interchanges necessary. 


Interchange y\ and y 2 - 



Compute 7 . 
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for (i=l;i<=mm;i++) anorm = -0.5*anorm*(n+i)*(ql—/i); 

> 

for (k=l;k<=(M-l);k++J { 
x[k] = (k-l)*h; 
facl=l. 0-x [k] *x [k] ; 
fac2=exp((-mm/2.0)*log(facl)) ; 
y[l] [k]=plgndr(n,mm,x[k] )*fac2; 
deriv = -((n-mm+l)*plgndr(n+l,mm,x[k] )- 
(n+l)*x[k] *plgndr(n,mm,x[k] ))/facl; 
y[2] [k]=mm*x[k]*y[l] [k]/facl+deriv*fac2; 
y [3] [k] =n* (n+1) -mm* (mm+1); 

> 

x [M] =1.0; 
y [1] [M]=anorm; 
y [3] [M]=n*(n+l)-mm*(nmi+l) ; 
y [2] [M] = (y [3] [M] -c2)*y [1] [M] / (2.0* (mm+1.0)) ; 
scalv[1]=fabs(anorm); 
scalv[2] = (y[2] [M] > scalv[l] ? y[2] [M] : scalv[l]); 
scalv[3] = (y [3] [M] > 1.0 ? y[3][M] : 1.0); 
for (;;) { 

printf("\nEnter c**2 or 999 to end.\n"); 
scanf (""/,f " ,&c2); 
if (c2 == 999) { 

free.f3tensor(c,1,NCI,1,NCJ,1,NCK); 
free_matrix(s,1,NSI,1,WSJ); 
free_matrix(y,1,NYJ,1,NYK); 
return 0; 

> 

solvde(itmax,conv,slowc,sealv,indexv, NE, NB, M, y, c, s); 
printf("\n ’/,s ’/,2d "/ 0 s "/,2d ’/,s "/,7.3f %s %10.6f\n", 

"m =",mm," n =",n," c**2 =",c2, 

" lamda =" ,y[3] [1]+mm*(mm+1)); 

> Return for another value of c 2 . 


Initial guess. 


P™ from §6.8. 

Derivative of P™ from a recur¬ 
rence relation. 


Initial guess at x = 1 done sep¬ 
arately. 


extern int mm,n,mpt; Defined in sfroid. 

extern float h,c2,anorm,x[]; 

void difeq(int k, int k1, int k2, int jsf. int isl, int isf, int indexv[], 
int ne, float **s, float **y) 

Returns matrix s for solvde. 

t 

float temp,tempi,temp2; 

if (k == kl) { 

if (n+mm fe 1) { 

s [3] [3+indexv[l]]=1.0: 
s [3] [3+indexv [2] ] =0.0: 
s [3] [3+indexv [3] ] =0.0 
s [3] [jsf]=y [1] [1]; 

> else { 

s [3] [3+indexv [1] ] =0.0: 
s [3] [3+indexv [2] ] =1.0: 
s [3] [3+indexv [3] ] =0.0 
s [3] [jsf]=y [2] [1]; 

> 

} else if (k > k2) { 

s[l] [3+indexv[1]] = -(y [3] [mpt]-c2)/(2.0*(mm+1.0)) ; (17.4.35). 

s [1] [3+indexv [2] ] =1.0; 

s[l] [3+indexv [3]] = -y [1] [mpt] / (2.0*(mm+l. 0)); 

s[l] [jsf]=y[2] [mpt]-(y[3] [mpt]-c2)*y[l] [mpt]/(2.0*(mm+1.0)); (17.4.33). 

s[2] [3+indexv[1]] =1.0; Equation (17.4.36). 


Boundary condition at first point. 
Equation (17.4.32). 

Equation (17.4.31). 

Equation (17.4.32). 

Equation (17.4.31). 

Boundary conditions at last point. 
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s [2] [3+indexv [2] ] =0.0; 
s [2] [3+indexv [3] ] =0.0; 
s[2] [jsf]=y[l] [mpt]-anorm; 

> else { 

s[l] [indexv[l]] = -1.0; 
s[l] [indexv[2]] = -0.5*h; 
s[l] [indexv[3]]=0.0; 
s [1] [3+indexv [1] ] =1.0; 
s[l] [3+indexv[2]] = -0.5*h; 
s [1] [3+indexv [3] ] =0.0; 
templ=x [k] +x [k-1] ; 
temp=h/(1.0-templ*templ*0.25); 
temp2=0.5*(y [3] [k]+y[3] [k-1])-c2*0.25*templ*templ; 
s[2] [indexv[l]]=temp*temp2*0.5; Equation (17.4.29). 

s[2][indexv[2]] = -1.0-0.5*temp*(mm+1.0)*templ; 
s[2] [indexv[3]]=0.25*temp*(y[1] [k]+y[l] [k-1]); 
s[2] [3+indexv[1] ] =s [2] [indexv[l]] ; 
s [2] [3+indexv [2] ] =2.0+s [2] [indexv [2] ] ; 
s [2] [3+indexv [3] ] =s [2] [indexv [3] ] ; 

s[3] [indexv[1]] =0.0; Equation (17.4.30). 

s[3] [indexv[2]] =0.0; 
s [3] [indexv [3]] = -1.0; 
s [3] [3+indexv [1] ] =0.0; 
s [3] [3+indexv [2] ] =0.0; 
s [3] [3+indexv [3] ] =1.0; 

s[l] [jsf]=y[l] [k]-y[1] [k-l]-0.5*h*(y[2] [k]+y[2] [k-1]); (17.4.23). 

s[2] [jsf]=y[2] [k] -y[2] [k-1]-temp*((x[k]+x[k-1] ) (17.4.24). 

*0.S*(mm+1.0)*(y[2] [k]+y[2] [k-l])-temp2 
*0.5*(y[l] [k]+y[l] [k-1])); 

s[3] [jsf] =y [3] [k]-y [3] [k-1] ; Equation (17.4.27). 

> 

> 


Equation (17.4.34). 
Interior point. 
Equation (17.4.28). 


You can run the program and check it against values of A mn (c) given in the tables 
at the back of Flammer’s book [1] or in Table 21.1 of Abramowitz and Stegun [ 2 ], 
Typically it converges in about 3 iterations. The table below gives a few comparisons. 


Selected Output of sf roid 

m 

n 

c 2 

^exact 

^sfroid 

2 

2 

0.1 

6.01427 

6.01427 



1.0 

6.14095 

6.14095 



4.0 

6.54250 

6.54253 

2 

5 

1.0 

30.4361 

30.4372 



16.0 

36.9963 

37.0135 

4 

11 

- 1.0 

131.560 

131.554 


Shooting 

To solve the same problem via shooting (§17.1), we supply a function derivs 
that implements equations (17.4.15)—(17.4.17). We will integrate the equations over 
the range — 1 < x < 0. We provide the function load which sets the eigenvalue 
y 3 to its current best estimate, v[l]. It also sets the boundary values of y\ and 
yi using equations (17.4.20) and (17.4.19) (with a minus sign corresponding to 
x = —1). Note that the boundary condition is actually applied a distance dx from 
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the boundary to avoid having to evaluate y' 2 right on the boundary. The function 
score follows from equation (17.4.18). 


#include <stdio.h> 

#include "nrutil.h" 

#define N2 1 

int m,n; Communicates with load, score, and derivs. 

float c2, dx, gmma; 

int nvar; Communicates with shoot, 

float xl,x2; 

int main(void) /* Program sphoot */ 

Sample program using shoot. Computes eigenvalues of spheroidal harmonics S mn (x;c) for 
m > 0 and n>m. Note how the routine vecfunc for newt is provided by shoot (§17.1). 

{ 

void newt (float x[], int n, int *check, 

void (*vecfunc) (int, float [] , float [])); 
void shoot(int n, float v[] , float f[]); 
int check,i; 
float ql,*v; 

v=vector(1,N2); 

dx=l ,0e-4; Avoid evaluating derivatives exactly at x = — 1. 

nvar=3; Number of equations, 

for (;;) { 

printf("input m,n,c-squared\n"); 
if (scanf("’/,d ’/,d %f",&m,&n,&c2) == EOF) break; 
if (n < m || m < 0) continue; 

gmma=1.0; Compute 7 of equation (17.4.20). 

ql=n; 

for (i=l;i<=m;i++) gmma *= -0.5*(n+i)*(ql—/i); 
v[l]=n*(n+l)-m*(m+l)+c2/2.0; Initial guess for eigenvalue, 

xl = -1.0+dx; Set range of integration. 

x2=0.0; 

newt(v,N2,&check,shoot) ; Find v that zeros function f in score, 

if (check) { 

printf("shoot failed; bad initial guess\n"); 

} else { 

printf("\tmu(m,n)\n"); 
printf ("7,12.6f \n" , v [1]); 

> 

> 

free.vector(v,1,N2); 
return 0; 

> 

void load(float xl, float v[], float y[]) 

Supplies starting values for integration at as si —1 + dx. 

{ 

float yl = (n-m & 1 ? -gmma : gmma); 
y [3]=v[l] ; 

y[2] = -(y[3]-c2)*yl/(2*(m+l)) ; 
y [l]=yl+y [2] *dx; 

> 

void score (float xf, float y[], float f[]) 

Tests whether boundary condition at x = 0 is satisfied. 

{ 

f[1] = (n-m & 1 ? y[1] : y[2]); 

> 
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void derivs(float x, float y[], float dydx[]) 

Evaluates derivatives for odeint. 

{ 

dydx [1] =y [2] ; 

dydx [2] = (2.0*x* (m+1.0) *y [2] - (y [3] -c2*x*x) *y [1]) / (1.0-x*x) ; 
dydx[3]=0.0; 

> 


Shooting to a Fitting Point 

For variety we illustrate shootf from §17.2 by integrating over the whole range 
— 1 + dx < x < 1 — dx, with the fitting point chosen to be at a; = 0. The routine 
derivs is identical to the one for shoot. Now, however, there are two load routines. 
The routine loadl for x = — 1 is essentially identical to load above. At x = 1, 
load 2 sets the function value y i and the eigenvalue i /3 to their best current estimates, 
v 2 [l] and v 2 [ 2 ], respectively. If you quite sensibly make your initial guess of 
the eigenvalue the same in the two intervals, then vl [ 1 ] will stay equal to v 2 [ 2 ] 
during the iteration. The function score simply checks whether all three function 
values match at the fitting point. 


#include <stdio.h> 

#include <math.h> 

#include "nrutil.h" 

#define N1 2 
#define N2 1 
#define NTOT (N1+N2) 

#define DXX 1.0e-4 

int m,n; Communicates with loadl, load2, score, 

float c2,dx,gmma; and derivs. 

int nn2,nvar; Communicates with shootf. 

float xl,x2,xf; 


int main(void) /* Program sphfpt */ 

Sample program using shootf. Computes eigenvalues of spheroidal harmonics S mn (x;c) for 
m > 0 and n > m. Note how the routine vecfunc for newt is provided by shootf (§17.2). 
The routine derivs is the same as for sphoot. 

{ 

void newt (float x[], int n, int *check, 

void (*vecfunc)(int, float [] , float [])); 
void shootf (int n, float v[], float f [] ) ; 
int check,i; 
float ql,*vl,*v2,*v; 

v=vector(l,NT0T); 
vl=v; 

v2 = &v [N2] ; 

nvar=NT0T; Number of equations. 

nn2=N2; 

dx=DXX; Avoid evaluating derivatives exactly at x = 

for (;;) { ifcl. 

printf("input m,n,c-squared\n"); 
if (scanf("’/,d ’/,d ’/,f" ,&m,fcn,&c2) == EOF) break; 
if (n < m || m < 0) continue; 

gmma=1.0; Compute 7 of equation (17.4.20). 

ql=n; 

for (i=l;i<=m;i++) gmma *= -0.5*(n+i)*(ql—/i); 
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vl [l]=n*(n+l)-m*(m+l)+c2/2.0; Initial guess for eigenvalue and function value. 

v2 [2] =vl [1] ; 

v2[1]=gmma*(1.0-(v2[2]-c2)*dx/(2*(m+1))); 

xl = -1.0+dx; Set range of integration. 

x2=l.0-dx; 

xf=0.0; Fitting point. 

newt(v,NTOT,&check,shootf) ; Find v that zeros function f in score, 

if (check) { 

printf("shootf failed; bad initial guess\n"); 

} else { 

printf("\tmu(m,n)\n"); 
printf ('"/.12.6f\n" ,v[l]); 

} 

> 

free_vector(v,1,NT0T); 
return 0; 

> 

void loadl(float xl, float vl[], float y[]) 

Supplies starting values for integration at is' == — 1 + dx. 

{ 

float yl = (n-m & 1 ? -gmma : girnna); 
y [3] =vl [1] ; 

y[2] = -(y[3]-c2)*yl/(2*(m+l)) ; 
y[l]=yl+y[2] *dx; 

> 

void load2(float x2, float v2[], float y[]) 

Supplies starting values for integration at x = 1 — dx. 

{ 

y [3] =v2 [2] ; 
y [1] =v2 [1] ; 

y [2] = (y [3] -c2) *y [1] / (2* (m+1)); 

> 

void score (float xf, float y[], float f[]) 

Tests whether solutions match at fitting point a? = 0. 

{ 

int i; 

for (i=l;i<=3;i++) f[i]=y[i]; 

> 


CITED REFERENCES AND FURTHER READING: 

Flammer, C. 1957, Spheroidal Wave Functions (Stanford, CA: Stanford University Press). [1] 
Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York), §21. [2] 

Morse, P.M., and Feshbach, H. 1953, Methods of Theoretical Physics, Part II (New York: McGraw- 
Hill), pp. 1502ff. [3] 
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17.5 Automated Allocation of Mesh Points 


In relaxation problems, you have to choose values for the independent variable at the 
mesh points. This is called allocating the grid or mesh. The usual procedure is to pick 
a plausible set of values and, if it works, to be content. If it doesn’t work, increasing the 
number of points usually cures the problem. 

If we know ahead of time where our solutions will be rapidly varying, we can put more 
grid points there and less elsewhere. Alternatively, we can solve the problem first on a uniform 
mesh and then examine the solution to see where we should add more points. We then repeat 
the solution with the improved grid. The object of the exercise is to allocate points in such 
a way as to represent the solution accurately. 

It is also possible to automate the allocation of mesh points, so that it is done 
“dynamically” during the relaxation process. This powerful technique not only improves 
the accuracy of the relaxation method, but also (as we will see in the next section) allows 
internal singularities to be handled in quite a neat way. Here we learn how to accomplish 
the automatic allocation. 

We want to focus attention on the independent variable x, and consider two alternative 
reparametrizations of it. The first, we term q: this is just the coordinate corresponding to the 
mesh points themselves, so that </ = 1 at k = 1, = 2 at k = 2, and so on. Between any two 

mesh points we have A q = 1. In the change of independent variable in the ODEs from x to q. 


becomes 


dy 

dx 


= g 


dy _ dx 
dq ® dq 


(17.5.1) 

(17.5.2) 


In terms of q, equation (17.5.2) as an FDE might be written 


y k -Jk-i - 1 




= 0 


(17.5.3) 


or some related version. Note that dx/dq should accompany g. The transformation between 
x and q depends only on the Jacobian dx/dq. Its reciprocal dq/dx is proportional to the 
density of mesh points. 

Now, given the function y (sc), or its approximation at the current stage of relaxation, 
we are supposed to have some idea of how we want to specify the density of mesh points. 
For example, we might want dq/dx to be larger where y is changing rapidly, or near to the 
boundaries, or both. In fact, we can probably make up a formula for what we would like 
dq/dx to be proportional to. The problem is that we do not know the proportionality constant. 
That is, the formula that we might invent would not have the correct integral over the whole 
range of x so as to make q vary from 1 to M, according to its definition. To solve this problem 
we introduce a second reparametrization Q(q), where Q is a new independent variable. The 
relation between Q and q is taken to be linear, so that a mesh spacing formula for dQ/dx 
differs only in its unknown proportionality constant. A linear relation implies 



or, expressed in the usual manner as coupled first-order equations. 


d QW dv _ ,) 

dq dq 


(17.5.5) 


where /: is a new intermediate variable. We add these two equations to the set of ODEs 
being solved. 

Completing the prescription, we add a third ODE that is just our desired mesh-density 
function, namely 


dQ 


dQ dq 
dq dx 



(17.5.6) 


imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



784 


Chapter 1 7 . Two Point Boundary Value Problems 


where <p(x) is chosen by us. Written in terms of the mesh variable q, this equation is 


dx ip 
dq <p{x) 


(17.5.7) 


Notice that cp(x) should be chosen to be positive definite, so that the density of mesh points 
is everywhere positive. Otherwise (17.5.7) can have a zero in its denominator. 

To use automated mesh spacing, you add the three ODEs (17.5.5) and (17.5.7) to your 
set of equations, i.e., to the array y [j] [k]. Now x becomes a dependent variable! Q and ip 
also become new dependent variables. Normally, evaluating <p requires little extra work since 
it will be composed from pieces of the g’s that exist anyway. The automated procedure allows 
one to investigate quickly how the numerical results might be affected by various strategies 
for mesh spacing. (A special case occurs if the desired mesh spacing function Q can be found 
analytically, i.e., dQ/dx is directly integrable. Then, you need to add only two equations, 
those in 17.5.5, and two new variables x, ip.) 

As an example of a typical strategy for implementing this scheme, consider a system 
with one dependent variable y(x). We could set 


or 


da: + jcnnty}; 

A 5 

t/x dQ 1 , \ dy/dx \ 
= A + |^| 


(17.5.8) 

(17.5.9) 


where A and <5 are constants that we choose. The first term would give a uniform spacing 
in x if it alone were present. The second term forces more grid points to be used where y is 
changing rapidly. The constants act to make every logarithmic change in y of an amount 8 
about as “attractive” to a grid point as a change in x of amount A. You adjust the constants 
according to taste. Other strategies are possible, such as a logarithmic spacing in x, replacing 
dx in the first term with din a;. 


CITED REFERENCES AND FURTHER READING: 

Eggleton, P. P. 1971, Monthly Notices of the Royal Astronomical Society , vol. 151, pp. 351-364. 
Kippenhan, R., Weigert, A., and Hofmeister, E. 1968, in Methods in Computational Physics, 
vol. 7 (New York: Academic Press), pp. 129ff. 


17.6 Handling Internal Boundary Conditions 
or Singular Points 


Singularities can occur in the interiors of two point boundary value problems. Typically, 
there is a point x s at which a derivative must be evaluated by an expression of the form 


S{x s ) = 


N{x s , y) 
D(x s , y) 


(17.6.1) 


where the denominator D(x s ,y) = 0. In physical problems with finite answers, singular 
points usually come with their own cure: Where D —> 0, there the physical solution y must 
be such as to make N —> 0 simultaneously, in such a way that the ratio takes on a meaningful 
value. This constraint on the solution y is often called a regularity condition. The condition 
that D(x s , y) satisfy some special constraint at x s is entirely analogous to an extra boundary 
condition, an algebraic relation among the dependent variables that must hold at a point. 

We discussed a related situation earlier, in §17.2, when we described the “fitting point 
method” to handle the task of integrating equations with singular behavior at the boundaries. 
In those problems you are unable to integrate from one side of the domain to the other. 
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Figure 17.6.1. FDE matrix structure with an internal boundary condition. The internal condition 
introduces a special block, (a) Original form, compare with Figure 17.3.1; (b) final form, compare 
with Figure 17.3.2. 


However, the ODEs do have well-behaved derivatives and solutions in the neighborhood of 
the singularity, so it is readily possible to integrate away from the point. Both the relaxation 
method and the method of “shooting” to a fitting point handle such problems easily. Also, 
in those problems the presence of singular behavior served to isolate some special boundary 
values that had to be satisfied to solve the equations. 

The difference here is that we are concerned with singularities arising at intermediate 
points, where the location of the singular point depends on the solution, so is not known a 
priori. Consequently, we face a circular task: The singularity prevents us from finding a 
numerical solution, but we need a numerical solution to find its location. Such singularities 
are also associated with selecting a special value for some variable which allows the solution 
to satisfy the regularity condition at the singular point. Thus, internal singularities take on 
aspects of being internal boundary conditions. 

One way of handling internal singularities is to treat the problem as a free boundary 
problem, as discussed at the end of §17.0. Suppose, as a simple example, we consider 
the equation 


dy _ N(x,y) 
dx D(x,y) 


(17.6.2) 


where N and D are required to pass through zero at some unknown point x s . We add 
the equation 


£-4 

dx 



Z = Xs — x l 


(17.6.3) 
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where x s is the unknown location of the singularity, and change the independent variable 
to t by setting 

x-xi =tz, 0 < t < 1 (17.6.4) 

The boundary conditions at t = 1 become 

N(x, y) = 0, D(x,y) = 0 (17.6.5) 

Use of an adaptive mesh as discussed in the previous section is another way to overcome 
the difficulties of an internal singularity. For the problem (17.6.2), we add the mesh spacing 
equations 


dQ 

~r = il> 
dq 


= 0 


dij) 

dq 


(17.6.6) 

(17.6.7) 


with a simple mesh spacing function that maps x uniformly into q, where q runs from 1 to 
M, the number of mesh points: 


Q(x) = x — xi, 


dQ = i 
dx 


(17.6.8) 


Having added three first-order differential equations, we must also add their corresponding 
boundary conditions. If there were no singularity, these could simply be 


at <7=1: x = xi, Q = 0 (17.6.9) 

at q = M : ® = xi (17.6.10) 


and a total of N values y, specified at q = 1. In this case the problem is essentially an 
initial value problem with all boundary conditions specified at xi and the mesh spacing 
function is superfluous. 

However, in the actual case at hand we impose the conditions 

at q= 1: x = Xi, Q = 0 (17.6.11) 

at q = M : N(x,y) = 0, D(x,y) = 0 (17.6.12) 


and TV — 1 values y, at q = 1. The “missing” y % is to be adjusted, in other words, so as 
to make the solution go through the singular point in a regular (zero-over-zero) rather than 
irregular (finite-over-zero) manner. Notice also that these boundary conditions do not directly 
impose a value for xi, which becomes an adjustable parameter that the code varies in an 
attempt to match the regularity condition. 

In this example the singularity occurred at a boundary, and the complication arose 
because the location of the boundary was unknown. In other problems we might wish to 
continue the integration beyond the internal singularity. For the example given above, we 
could simply integrate the ODEs to the singular point, then as a separate problem recommence 
the integration from the singular point on as far we care to go. However, in other cases the 
singularity occurs internally, but does not completely determine the problem: There are still 
some more boundary conditions to be satisfied further along in the mesh. Such cases present 
no difficulty in principle, but do require some adaptation of the relaxation code given in §17.3. 
In effect all you need to do is to add a “special” block of equations at the mesh point where 
the internal boundary conditions occur, and do the proper bookkeeping. 

Figure 17.6.1 illustrates a concrete example where the overall problem contains 5 
equations with 2 boundary conditions at the first point, one “internal” boundary condition, and 
two final boundary conditions. The figure shows the structure of the overall matrix equations 
along the diagonal in the vicinity of the special block. In the middle of the domain, blocks 
typically involve 5 equations (rows) in 10 unknowns (columns). For each block prior to the 
special block, the initial boundary conditions provided enough information to zero the first 
two columns of the blocks. The five FDEs eliminate five more columns, and the final three 
columns need to be stored for the backsubstitution step (as described in §17.3). To handle the 
extra condition we break the normal cycle and add a special block with only one equation: 
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the internal boundary condition. This effectively reduces the required storage of unreduced 
coefficients by one column for the rest of the grid, and allows us to reduce to zero the first 
three columns of subsequent blocks. The functions red, pinvs, bksub can readily handle 
these cases with minor recoding, but each problem makes for a special case, and you will 
have to make the modifications as required. 


CITED REFERENCES AND FURTHER READING: 

London, R.A., and Flannery, B.P. 1982, Astrophysical Journal, vol. 258, pp. 260-269. 
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Chapter 18. Integral Equations 
and Inverse Theory 

18.0 Introduction 


Many people, otherwise numerically knowledgable, imagine that the numerical 
solution of integral equations must be an extremely arcane topic, since, until recently, 
it was almost never treated in numerical analysis textbooks. Actually there is a 
large and growing literature on the numerical solution of integral equations; several 
monographs have by now appeared [1-3], One reason for the sheer volume of this 
activity is that there are many different kinds of equations, each with many different 
possible pitfalls; often many different algorithms have been proposed to deal with 
a single case. 

There is a close correspondence between linear integral equations, which specify 
linear, integral relations among functions in an infinite-dimensional function space, 
and plain old linear equations, which specify analogous relations among vectors 
in a finite-dimensional vector space. Because this correspondence lies at the heart 
of most computational algorithms, it is worth making it explicit as we recall how 
integral equations are classified. 

Fredholm equations involve definite integrals with fixed upper and lower limits. 
An inhomogeneous Fredholm equation of the first kind has the form 

g(t) = J K(t,s)f(s)ds (18.0.1) 

Here f(t) is the unknown function to be solved for, while g(t) is a known “right-hand 
side.” (In integral equations, for some odd reason, the familiar “right-hand side” is 
conventionally written on the left!) The function of two variables, K(t, s ) is called 
the kernel. Equation (18.0.1) is analogous to the matrix equation 

K • f = g (18.0.2) 

whose solutionis f = K 1 g, where K -1 is the matrix inverse. Like equation (18.0.2), 
equation (18.0.1) has a unique solution whenever g is nonzero (the homogeneous 
case with g = 0 is almost never useful) and K is invertible. However, as we shall 
see, this latter condition is as often the exception as the rule. 

The analog of the finite-dimensional eigenvalue problem 



(K — <rl) • f = g 


(18.0.3) 
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is called a Fredholm equation of the second kind, usually written 

f{l) = A j" K(L s)f(s) ds + g(t) (18.0.4) 


Again, the notational conventions do not exactly correspond: A in equation (18.0.4) 
is I/it in (18.0.3), while g is —g/ A. If g (or g) is zero, then the equation is said 
to be homogeneous. If the kernel K(t, s ) is bounded, then, like equation (18.0.3), 
equation (18.0.4) has the property that its homogeneous form has solutions for 
at most a denumerably infinite set A = A„, n - 1,2,..., the eigenvalues. The 
corresponding solutions f n (t) are the eigenfunctions. The eigenvalues are real if 
the kernel is symmetric. 

In the inhomogeneous case of nonzero g (or g), equations (18.0.3) and (18.0.4) 
are soluble except when A (or o) is an eigenvalue — because the integral operator 
(or matrix) is singular then. In integral equations this dichotomy is called the 
Fredholm alternative. 

Fredholm equations of the first kind are often extremely ill-conditioned. Ap¬ 
plying the kernel to a function is generally a smoothing operation, so the solution, 
which requires inverting the operator, will be extremely sensitive to small changes 
or errors in the input. Smoothing often actually loses information, and there is no 
way to get it back in an inverse operation. Specialized methods have been developed 
for such equations, which are often called inverse problems. In general, a method 
must augment the information given with some prior knowledge of the nature of the 
solution. This prior knowledge is then used, in one way or another, to restore lost 
information. We will introduce such techniques in §18.4. 

Inhomogeneous Fredholm equations of the second kind are much less often 
ill-conditioned. Equation (18.0.4) can be rewritten as 



— od(t — s)]f(s) ds = —og(t) 


(18.0.5) 


where S(t — s) is a Dirac delta function (and where we have changed from A to its 
reciprocal a for clarity). If a is large enough in magnitude, then equation (18.0.5) 
is, in effect, diagonally dominant and thus well-conditioned. Only if cr is small do 
we go back to the ill-conditioned case. 

Homogeneous Fredholm equations of the second kind are likewise not partic¬ 
ularly ill-posed. If A' is a smoothing operator, then it will map many /’s to zero, 
or near-zero; there will thus be a large number of degenerate or nearly degenerate 
eigenvalues around cr = 0 (A —> oo), but this will cause no particular computational 
difficulties. In fact, we can now see that the magnitude of a needed to rescue the 
inhomogeneous equation (18.0.5) from an ill-conditioned fate is generally much less 
than that required for diagonal dominance. Since the cr term shifts all eigenvalues, 
it is enough that it be large enough to shift a smoothing operator’s forest of near¬ 
zero eigenvalues away from zero, so that the resulting operator becomes invertible 
(except, of course, at the discrete eigenvalues). 

Volterra equations are a special case of Fredholm equations with K (f, s) = 0 
for s > t. Chopping off the unnecessary part of the integration, Volterra equations are 
written in a form where the upper limit of integration is the independent variable t. 
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The Volterra equation of the first kind 

g(t) -- J K(t,s)f(s)ds (18.0.6) 

has as its analog the matrix equation (now written out in components) 

k 

Y. K ^h = Vk (18.0.7) 

3=1 

Comparing with equation (18.0.2), we see that the Volterra equation corresponds to 
a matrix K that is lower (i.e., left) triangular, with zero entries above the diagonal. 
As we know from Chapter 2, such matrix equations are trivially soluble by forward 
substitution. Techniques for solving Volterra equations are similarly straightforward. 
When experimental measurement noise does not dominate, Volterra equations of the 
first kind tend not to be ill-conditioned; the upper limit to the integral introduces a 
sharp step that conveniently spoils any smoothing properties of the kernel. 

The Volterra equation of the second kind is written 

f(t) = J K(t,s)f(s)ds +g(t) (18.0.8) 

whose matrix analog is the equation 


(K — 1) • f = g (18.0.9) 

with K lower triangular. The reason there is no A in these equations is that (i) in 
the inhomogeneous case (nonzero g) it can be absorbed into K, while (ii) in the 
homogeneous case (g = 0), it is a theorem that Volterra equations of the second kind 
with bounded kernels have no eigenvalues with square-integrable eigenfunctions. 

We have specialized our definitions to the case of linear integral equations. 
The integrand in a nonlinear version of equation (18.0.1) or (18.0.6) would be 
K(t,s, f(s)) instead of K(t, ,s)/(.s); a nonlinear version of equation (18.0.4) or 
(18.0.8) would have an integrand K(t, s, Nonlinear Fredholm equations 

are considerably more complicated than their linear counterparts. Fortunately, they 
do not occur as frequently in practice and we shall by and large ignore them in this 
chapter. By contrast, solving nonlinear Volterra equations usually involves only a 
slight modification of the algorithm for linear equations, as we shall see. 

Almost all methods for solving integral equations numerically make use of 
quadrature rules , frequently Gaussian quadratures. This would be a good time 
for you to go back and review §4.5, especially the advanced material towards the 
end of that section. 

In the sections that follow, we first discuss Fredholm equations of the second kind 
with smooth kernels (§18.1). Nontrivial quadrature rules come into the discussion, 
but we will be dealing with well-conditioned systems of equations. We then return 
to Volterra equations (§18.2), and find that simple and straightforward methods are 
generally satisfactory for these equations. 

In §18.3 we discuss how to proceed in the case of singular kernels, focusing 
largely on Fredholm equations (both first and second kinds). Singularities require 
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special quadrature rules, but they are also sometimes blessings in disguise, since they 
can spoil a kernel’s smoothing and make problems well-conditioned. 

In §§18.4-18.7 we face up to the issues of inverse problems. §18.4 is an 
introduction to this large subject. 

We should note here that wavelet transforms, already discussed in §13.10, are 
applicable not only to data compression and signal processing, but can also be used 
to transform some classes of integral equations into sparse linear problems that allow 
fast solution. You may wish to review §13.10 as part of reading this chapter. 

Some subjects, such as integro-differential equations, we must simply declare 
to be beyond our scope. For a review of methods for integro-differential equations, 
see Brunner [4], 

It should go without saying that this one short chapter can only barely touch on 
a few of the most basic methods involved in this complicated subject. 


CITED REFERENCES AND FURTHER READING: 

Delves, L.M., and Mohamed, J.L. 1985, Computational Methods for Integral Equations (Cam¬ 
bridge, U.K.: Cambridge University Press). [1] 

Linz, P. 1985, Analytical and Numerical Methods for Volterra Equations (Philadelphia: S.I.A.M.). 
[2] 

Atkinson, K.E. 1976, A Survey of Numerical Methods for the Solution of Fredholm Integral 
Equations of the Second Kind (Philadelphia: S.I.A.M.). [3] 

Brunner, H. 1988, in Numerical Analysis 1987, Pitman Research Notes in Mathematics vol. 170, 
D.F. Griffiths and G.A. Watson, eds. (Harlow, Essex, U.K.: Longman Scientific and Tech¬ 
nical), pp. 18-38. [4] 

Smithies, F. 1958, Integral Equations (Cambridge, U.K.: Cambridge University Press). 

Kanwal, R.P. 1971, Linear Integral Equations (New York: Academic Press). 

Green, C.D. 1969, Integral Equation Methods (New York: Barnes & Noble). 


18.1 Fredholm Equations of the Second Kind 

We desire a numerical solution for f(t) in the equation 

/(f) = A J K(t,s)f(s)ds +g(t) (18.1.1) 

The method we describe, a very basic one, is called the Nystrom method. It requires 
the choice of some approximate quadrature rule: 


,6 n 

/ y(s)ds=J2'w j y(s j ) 

Ja 3=1 


(18.1.2) 



Here the set {wj} are the weights of the quadrature rule, while the N points {s j } 
are the abscissas. 

What quadrature rule should we use? It is certainly possible to solve integral 
equations with low-order quadrature rules like the repeated trapezoidal or Simpson’s 
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rules. We will see, however, that the solution method involves 0(N 3 ) operations, 
and so the most efficient methods tend to use high-order quadrature rules to keep 
N as small as possible. For smooth, nonsingular problems, nothing beats Gaussian 
quadrature (e.g., Gauss-Legendre quadrature, §4.5). (For non-smooth or singular 
kernels, see §18.3.) 

Delves and Mohamed [1 ] investigated methods more complicated than the 
Nystrom method. For straightforward Fredholm equations of the second kind, they 
concluded “... the clear winner of this contest has been the Nystrom routine... with 

the iV-point Gauss-Legendre rule. This routine is extremely simple_Such results 

are enough to make a numerical analyst weep.” 

If we apply the quadrature rule (18.1.2) to equation (18.1.1), we get 

1 v 

fit) = ^^WjKi^s^fisj) + g(t) (18.1.3) 

3 =1 

Evaluate equation (18.1.3) at the quadrature points: 

N 

fiu ) = *'52 w 3 K (t i ,s j )f(s j ) + g(ti) (18.1.4) 

3 =i 

Let fi be the vector /(i,), g, the vector g(ti), K l3 the matrix K (f,, sj), and define 
Kij = KijWj (18.1.5) 

Then in matrix notation equation (18.1.4) becomes 

(1 — AK) • f = g (18.1.6) 

This is a set of N linear algebraic equations in N unknowns that can be solved 
by standard triangular decomposition techniques (§2.3) — that is where the 0(N 3 ) 
operations count comes in. The solution is usually well-conditioned, unless A is 
very close to an eigenvalue. 

Having obtained the solution at the quadrature points {G}, how do you get the 
solution at some other point f? You do not simply use polynomial interpolation. 
This destroys all the accuracy you have worked so hard to achieve. Nystrom’s key 
observation was that you should use equation (18.1.3) as an interpolatory formula, 
maintaining the accuracy of the solution. 

We here give two routines for use with linear Fredholm equations of the second 
kind. The routine fred 2 sets up equation (18.1.6) and then solves it by LU 
decomposition with calls to the routines ludcmp and lubksb. The Gauss-Legendre 
quadrature is implemented by first getting the weights and abscissas with a call to 
gauleg. Routine f red 2 requires that you provide an external function that returns 
g(t) and another that returns \K t] . It then returns the solution / at the quadrature 
points. It also returns the quadrature points and weights. These are used by the 
second routine fredin to carry out the Nystrom interpolation of equation (18.1.3) 
and re ton the value of / at any point in the interval [a, 6]. 
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#include "nrutil.h" 


void fred2(int n, float a, float b, float t[], float f[], float y[], 
float (*g)(float), float (*ak)(float, float)) 

Solves a linear Fredholm equation of the second kind. On input, a and b are the limits of 
integration, and n is the number of points to use in the Gaussian quadrature, g and ak are 
user-supplied external functions that respectively return g(t ) and A K(t,s). The routine returns 
arrays t[l. .n] and f [1. .n] containing the abscissas t; of the Gaussian quadrature and the 
solution / at these abscissas. Also returned is the array w[l. .n] of Gaussian weights for use 
with the Nystrom interpolation routine fredin. 

f 

void gauleg(float xl, float x2, float x[], float w[] , int n); 
void lubksb(float **a, int n, int *indx, float b[]); 
void ludcmp(float **a, int n, int *indx, float *d); 
int i,j,*indx; 
float d,**omk; 


indx=ivector(l,n); 
omk=matrix(l,n,l,n); 
gauleg(a,b,t,w,n); 
for (i=l;i<=n;i++) { 
for (j=l;j<=n;j++) 

omk[i] [j] = (float) (i 
f [i] = (*g) (t[i]); 

> 

ludcmp(omk,n,indx,fed); 
lubksb(omk,n,indx,f); 
free_matrix(omk,l,n,l,n); 
free_ivector(indx,l,n); 


Replace gauleg with another routine if not using 
Gauss-Legend re quadrature. 

Form 1 - AK. 

j)-(*ak) (t [i] ,t [j] )*w[j] ; 


Solve linear equations. 


float fredin(float x, int n, float a, float b, float t[], float f [] , 
float w[], float (*g)(float), float (*ak) (float, float)) 

Given arrays t[l. .n] and w[l. .n] containing the abscissas and weights of the Gaussian 
quadrature, and given the solution array f [1. .n] from fred2, this function returns the value of 
/ at x using the Nystrom interpolation formula. On input, a and b are the limits of integration, 
and n is the number of points used in the Gaussian quadrature, g and ak are user-supplied 
external functions that respectively return g(t) and A K(t,s). 

i 

int i; 

float sum=0.0; 

for (i=l;i<=n;i++) sum += (*ak)(x,t[i])*w[i]*f [i]; 
return (*g)(x)+sum; 


One disadvantage of a method based on Gaussian quadrature is that there is no 
simple way to obtain an estimate of the error in the result. The best practical method 
is to increase N by 50%, say, and treat the difference between the two estimates as a 
conservative estimate of the error in the result obtained with the larger value of N. 

Turn now to solutions of the homogeneous equation. If we set A = 1/cr and 
g = 0, then equation (18.1.6) becomes a standard eigenvalue equation 

K • f = trf (18.1.7) 

which we can solve with any convenient matrix eigenvalue routine (see Chapter 
11). Note that if our original problem had a symmetric kernel, then the matrix K 
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is symmetric. However, since the weights Wj are not equal for most quadrature 
rules, the matrix K (equation 18.1.5) is not symmetric. The matrix eigenvalue 
problem is much easier for symmetric matrices, and so we should restore the 
symmetry if possible. Provided the weights are positive (which they are for Gaussian 
quadrature), we can define the diagonal matrix D = diag(u; ; ) and its square root, 
D 1 / 2 = diag( yjw]). Then equation (18.1.7) becomes 


Multiplying by D 1 / 2 , we get 

(d 1/2 • K • D 1/2 ) • h = ah (18.1.8) 

where h = D 1 / 2 • f. Equation (18.1.8) is now in the form of a symmetric eigenvalue 
problem. 

Solution of equations (18.1.7) or (18.1.8) will in general give N eigenvalues, 
where N is the number of quadrature points used. For square-integrable kernels, 
these will provide good approximations to the lowest N eigenvalues of the integral 
equation. Kernels of finite rank (also called degenerate or separable kernels) have 
only a finite number of nonzero eigenvalues (possibly none). You can diagnose 
this situation by a cluster of eigenvalues a that are zero to machine precision. The 
number of nonzero eigenvalues will stay constant as you increase N to improve 
their accuracy. Some care is required here: A nondegenerate kernel can have an 
infinite number of eigenvalues that have an accumulation point at cr = 0. You 
distinguish the two cases by the behavior of the solution as you increase N. If you 
suspect a degenerate kernel, you will usually be able to solve the problem by analytic 
techniques described in all the textbooks. 


CITED REFERENCES AND FURTHER READING: 

Delves, L.M., and Mohamed, J.L. 1985, Computational Methods for Integral Equations (Cam¬ 
bridge, U.K.: Cambridge University Press). [1] 

Atkinson, K.E. 1976, A Survey of Numerical Methods for the Solution of Fredholm Integral 
Equations of the Second Kind (Philadelphia: S.I.A.M.). 


18.2 Volterra Equations 



Let us now turn to Volterra equations, of which our prototype is the Volterra 
equation of the second kind. 


S, § g 



K(t, s)f(s) ds + g(t) 


(18.2.1) 


Most algorithms for Volterra equations march out from t = a, building up the solution 
as they go. In this sense they resemble not only forward substitution (as discussed 
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in §18.0), but also initial-value problems for ordinary differential equations. In fact, 
many algorithms for ODEs have counterparts for Volterra equations. 

The simplest way to proceed is to solve the equation on a mesh with uniform 
spacing: 


t, (i — ih. 4 = 0,1,..., N, I, = - A .f' (18.2.2) 

To do so, we must choose a quadrature rule. For a uniform mesh, the simplest 
scheme is the trapezoidal rule, equation (4.1.11): 


£ K(U, s)f(s) ds=h ^K i0 fa + £ Kijfj + | K u f)j (18.2.3) 


Thus the trapezoidal method for equation (18.2.1) is: 

/o = go 


(1 - § hK u )fi = h ( \K i0 f 0 + ]T Kijfj 
i=i 


■ 9i ) 




(18.2.4) 


(For a Volterra equation of the first kind, the leading 1 on the left would be absent, 
and g would have opposite sign, with corresponding straightforward changes in the 
rest of the discussion.) 

Equation (18.2.4) is an explicit prescription that gives the solution in 0(N 2 ) 
operations. Unlike Fredholm equations, it is not necessary to solve a system of linear 
equations. Volterra equations thus usually involve less work than the corresponding 
Fredholm equations which, as we have seen, do involve the inversion of, sometimes 
large, linear systems. 

The efficiency of solving Volterra equations is somewhat counterbalanced by 
the fact that systems of these equations occur more frequently in practice. If we 
interpret equation (18.2.1) as a vector equation for the vector of m functions f(t), 
then the kernel K(t, s) is an m x m matrix. Equation (18.2.4) must now also be 
understood as a vector equation. For each i, we have to solve the m x m set of 
linear algebraic equations by Gaussian elimination. 

The routine voltra below implements this algorithm. You must supply an 
external function that returns the fcth function of the vector g(t) at the point t, and 
another that returns the (k, l ) element of the matrix K(t, s ) at (f, s). The routine 
voltra then returns the vector /(f) at the regularly spaced points ti. 


#include "nrutil.h" 

void voltra(int n, int m, float tO, float h, float *t, float **f, 
float (*g)(int, float), float (*ak)(int, int, float, float)) 

Solves a set of m linear Volterra equations of the second kind using the extended trapezoidal rule. 
On input, tO is the starting point of the integration and n-1 is the number of steps of size h to 
be taken. g(k,t) is a user-supplied external function that returns gk(t), while ak(k,l,t,s) 
is another user-supplied external function that returns the ( k,l ) element of the matrix K(t,s). 
The solution is returned in f [1 . .m] [1 . .n] , with the corresponding abscissas in t [1 . .n] . 

{ 
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void lubksb(float **a, int n, int *indx, float b[]); 
void ludcmp(float **a, int n, int *indx, float *d); 
int i,jjk.l.+indx; 
float d,sum,**a,*b; 

indx=ivector(l,m); 
a=matrix(l,m,1,m); 
b=vector(l,m); 
t [1] =t0; 

for (k=l ;k<=m;k++) f [k] [l] = (*g) (k,t [1] ) ; Initialize, 
for (i=2; i<=n;i++) { Take a step h. 

t [i] =t [i-1] +h; 
for (k=l;k<=m;k++) { 

sum=(*g) (k,t [i] ); Accumulate right-hand side of linear 

for (l=l;l<=m;l++) { equations in sum. 

sum += 0.5*h*(*ak) (k,l,t [i] ,t [1] )*f [1] [1] ; 
for (j=2;j<i;j++) 

sum += h*(*ak) (k,l,t [i] ,t [j] )*f [1] [j] ; 
a[k] [l] = (k == l)-0.5*h*(*ak) (k,l,t [i] ,t [i]); Left-hand side goes 
> in matrix a. 

b[k]=sum; 

> 

ludcmp(a,m,indx,&d); Solve linear equations. 

lubksb(a,m,indx,b); 

for (k=l;k<=m;k++) f[k][i]=b[k]; 

> 

free_vector(b,1,m); 
free_matrix(a,l,m,l,m); 
free_ivector(indx,1,m); 


For nonlinear Volterra equations, equation (18.2.4) holds with the product K ,, f, 
replaced by Ku(fi), and similarly for the other two products of K' s and /’s. Thus 
for each i we solve a nonlinear equation for f t with a known right-hand side. 
Newton’s method (§9.4 or §9.6) with an initial guess of /*_i usually works very 
well provided the stepsize is not too big. 

Higher-order methods for solving Volterra equations are, in our opinion, not as 
important as for Fredholm equations, since Volterra equations are relatively easy to 
solve. However, there is an extensive literature on the subject. Several difficulties 
arise. First, any method that achieves higher order by operating on several quadrature 
points simultaneously will need a special method to get started, when values at the 
first few points are not yet known. 

Second, stable quadrature rules can give rise to unexpected instabilities in 
integral equations. For example, suppose we try to replace the trapezoidal rule in 
the algorithm above with Simpson’s rule. Simpson’s rule naturally integrates over 
an interval 2 h, so we easily get the function values at the even mesh points. For the 
odd mesh points, we could try appending one panel of trapezoidal rule. But to which 
end of the integration should we append it? We could do one step of trapezoidal rule 
followed by all Simpson’s rule, or Simpson’s rule with one step of trapezoidal rule 
at the end. Surprisingly, the former scheme is unstable, while the latter is fine! 

A simple approach that can be used with the trapezoidal method given above 
is Richardson extrapolation: Compute the solution with stepsize h and /i/2. Then, 
assuming the error scales with h 2 , compute 


/e 


4/(/i/2) - f(h) 



3 


(18.2.5) 
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This procedure can be repeated as with Romberg integration. 

The general consensus is that the best of the higher order methods is the 
block-by-block method (see [1 ]). Another important topic is the use of variable 
stepsize methods, which are much more efficient if there are sharp features in K or 
/. Variable stepsize methods are quite a bit more complicated than their counterparts 
for differential equations; we refer you to the literature [1,2] for a discussion. 

You should also be on the lookout for singularities in the integrand. If you find 
them, then look to §18.3 for additional ideas. 

CITED REFERENCES AND FURTHER READING: 

Linz, R 1985, Analytical and Numerical Methods for Volterra Equations (Philadelphia: S.I.A.M.). 
[1] 

Delves, L.M., and Mohamed, J.L. 1985, Computational Methods for Integral Equations (Cam¬ 
bridge, U.K.: Cambridge University Press). [2] 


18.3 Integral Equations with Singular Kernels 


Many integral equations have singularities in either the kernel or the solution or both. 
A simple quadrature method will show poor convergence with N if such singularities are 
ignored. There is sometimes art in how singularities are best handled. 

We start with a few straightforward suggestions: 

1. Integrable singularities can often be removed by a change of variable. For example, the 
singular behavior A(t, s) ~ s 1-/2 or s~ 1/ " 2 near s = 0 can be removed by the transformation 
z = s 1/2 . Note that we are assuming that the singular behavior is confined to K, whereas 
the quadrature actually involves the product K(t, s)f(s), and it is this product that must be 
“fixed.” Ideally, you must deduce the singular nature of the product before you try a numerical 
solution, and take the appropriate action. Commonly, however, a singular kernel does not 
produce a singular solution /(f). (The highly singular kernel K(t, s ) = 5{t — s) is simply 
the identity operator, for example.) 

2. If K(t,s ) can be factored as w(s)K(t,s), where w(s) is singular and K(t,s) is 
smooth, then a Gaussian quadrature based on w(s) as a weight function will work well. Even 
if the factorization is only approximate, the convergence is often improved dramatically. All 
you have to do is replace gauleg in the routine f red2 by another quadrature routine. Section 
4.5 explained how to construct such quadratures; or you can find tabulated abscissas and 
weights in the standard references [1,2]. You must of course supply K instead of K. 

This method is a special case of the product Nystrom method [3,4], where one factors out 
a singular term p(t, s) depending on both t and s from K and constructs suitable weights for 
its Gaussian quadrature. The calculations in the general case are quite cumbersome, because 
the weights depend on the chosen {/,, } as well as the form of p(l, s). 

We prefer to implement the product Nystrom method on a uniform grid, with a quadrature 
scheme that generalizes the extended Simpson’s 3/8 rule (equation 4.1.5) to arbitrary weight 
functions. We discuss this in the subsections below. 

3. Special quadrature formulas are also useful when the kernel is not strictly singular, 
but is “almost” so. One example is when the kernel is concentrated near t = s on a scale much 
smaller than the scale on which the solution /(f) varies. In that case, a quadrature formula 
can be based on locally approximating f(s ) by a polynomial or spline, while calculating the 
first few moments of the kernel K(t,s) at the tabulation points U. In such a scheme the 
narrow width of the kernel becomes an asset, rather than a liability: The quadrature becomes 
exact as the width of the kernel goes to zero. 

4. An infinite range of integration is also a form of singularity. Truncating the range at a 
large finite value should be used only as a last resort. If the kernel goes rapidly to zero, then 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 



798 


Chapter 18. Integral Equations and Inverse Theory 


a Gauss-Laguerre \w ~ exp(— as)] or Gauss-Hermite [w ~ exp(— s 2 )] quadrature should 
work well. Long-tailed functions often succumb to the transformation 

s =j^l~ a t 18 - 3 - 1 ) 

which maps 0<s<ootol>z>—Iso that Gauss-Legendre integration can be used. 
Here a > 0 is a constant that you adjust to improve the convergence. 

5. A common situation in practice is that K(t, s) is singular along the diagonal line 
t = s. Here the Nystrom method fails completely because the kernel gets evaluated at s t ). 
Subtraction of the singularity is one possible cure: 


£ K(t, s)f(s) ds = J* K(t, s)[/(s) - /(f)] ds + J" K(t, s)f(t) o 


(18.3.2) 


= / K(t,s)[f(s) - f(t)]ds + r(t)f(t) 


where r(f) = jtf K(t, s ) ds is computed analytically or numerically. If the first term on 
the right-hand side is now regular, we can use the Nystrom method. Instead of equation 
(18.1.4), we get 

N 

= [/# - fi\ + A nfi + 9i (18.3.3) 

3 = 1 


Sometimes the subtraction process must be repeated before the kernel is completely regularized. 
See [3] for details. (And read on for a different, we think better, way to handle diagonal 
singularities.) 


Quadrature on a Uniform Mesh with Arbitrary Weight 


It is possible in general to find n-point linear quadrature rules that approximate the 
integral of a function /( x), times an arbitrary weight function 'w(x), over an arbitrary range 
of integration (a, 6), as the sum of weights times n evenly spaced values of the function f(x), 
say at x = kh, (k + l)h,.... (k + n — 1 )h. The general scheme for deriving such quadrature 
rules is to write down the n linear equations that must be satisfied if the quadrature rule is 
to be exact for the n functions f(x) = const, x, x 2 ,..., x n ~ 1 , and then solve these for the 
coefficients. This can be done analytically, once and for all, if the moments of the weight 
function over the same range of integration, 

W„ = -^ J x n w(x)dx (18.3.4) 

are assumed to be known. Here the prefactor h~ n is chosen to make W n scale as h if (as 
in the usual case) b — a is proportional to h. 

Carrying out this prescription for the four-point case gives the result 


*6 

w(x)f(x)dx = 

i/(fch) [(fc + l)(fc + 2 )(k + 3)Wo - (3fc 2 + 12 k + ll)Wi + 3 (k + 2)W 2 - W 3 j 

+ |/([fe + l]h) [ - k(k + 2)(fc + 3)Wo + (3k 2 + lOfc + 6)Wi - (3fc + 5 )W 2 + W 3 j 

+ i/([fc + 2 \h) ^k(k + 1 )(k + 3)Wo - (3k 2 +8 k + 3)Wi + (3k + 4)W 2 - W 3 j 

+ ^f([k + 3]ft) - k(k + 1 )(k + 2)Wo + (3k 2 + 6fc + 2)Wi - 3 (k + 1)W 2 + W 3 J 

(18.3.5) 



s o- i 
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While the terms in brackets superficially appear to scale as fc 2 , there is typically cancellation 
at both 0(k 2 ) and 0(k). 

Equation (18.3.5) can be specialized to various choices of (o, 6). The obvious choice 
is a = kh, b = (fc + 3)/i, in which case we get a four-point quadrature rule that generalizes 
Simpson’s 3/8 rule (equation 4.1.5). In fact, we can recover this special case by setting 
w(x) = 1, in which case (18.3.4) becomes 

Wn + 3) 71+1 - k n+1 ] (18.3.6) 

The four terms in square brackets equation (18.3.5) each become independent of k, and 
(18.3.5) in fact reduces to 

r(k+3)h of. qu qf. 

J /(!)<&= y/(fch)+y/([fc+l]h) + y/([fc + 2]h)+^/([fc + 3]ft) (18.3.7) 

Back to the case of general w(x), some other choices for a and b are also useful. For 
example, we may want to choose (a, b) to be ([fc + 1 ]h, [k + 3]/t) or ([fc + 2]h, [k + 3]h), 
allowing us to finish off an extended rule whose number of intervals is not a multiple 
of three, without loss of accuracy: The integral will be estimated using the four values 
f(kh),..., f([k + 3]/i). Even more useful is to choose (a, b) to be ([fc + 1 ]h, [fc + 2]/i), thus 
using four points to integrate a centered single interval. These weights, when sewed together 
into an extended formula, give quadrature schemes that have smooth coefficients, i.e., without 
the Simpson-like 2,4, 2, 4,2 alternation. (In fact, this was the technique that we used to derive 
equation 4.1.14, which you may now wish to reexamine.) 

All these rules are of the same order as the extended Simpson’s rule, that is, exact 
for f(x) a cubic polynomial. Rules of lower order, if desired, are similarly obtained. The 
three point formula is 

J* w{x)f(x)dx = | f(kh) [(fc + l)(fc + 2) Wo - (2fc + 3) Wi + W 2 J 

+ /([fc- 1 ]h) [ - fc(fc + 2) Wo + 2 (fc + l)Wi - W 2 J (18.3.8) 

+ \f([k + 2 ]h) [fc(fc + 1)W 0 - (2fc + l)Wi + W 2 ] 

Here the simple special case is to take, w(x) = 1, so that 

W n 2) n+1 - k n+1 ] (18.3.9) 

Then equation (18.3.8) becomes Simpson’s rule, 

Mk+2)h h Ah h 

J f(x)dx=^f(kh) + -^f([k + l]h) + ^f([k + 2]h) (18.3.10) 

For nonconstant weight functions w(x), however, equation (18.3.8) gives rules of one order 
less than Simpson, since they do not benefit from the extra symmetry of the constant case. 

The two point formula is simply 


r{k+l)h 

/ w(x)f(x)dx = f(kh)[(k + l)Wo-Wi]+f([k + l]h)[-kW 0 + Wi] (18.3.11) 
Jkh 

Here is a routine wwghts that uses the above formulas to return an extended A’-point 
quadrature rule for the interval (a, b) = (0, [N — l]/t). Input to wwghts is a user-supplied 
routine, kermom, that is called to get the first four indefinite-integral moments of w(x), namely 

F m (y) = J V s m w(s)ds m = 0,1,2,3 (18.3.12) 

(The lower limit is arbitrary and can be chosen for convenience.) Cautionary note: When 
called with N < 4, wwghts returns a rule of lower order than Simpson; you should structure 
your problem to avoid this. 
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void wwghts(float wghts[], int n, float h, 
void (*kermom)(double [], double ,int)) 

Constructs in wghts [1. .n] weights for the n-point equal-interval quadrature from 0 to (n—l)h 
of a function f(x) times an arbitrary (possibly singular) weight function w(x) whose indefinite- 
integral moments F n (y) are provided by the user-supplied routine kermom. 

{ 

int j,k; 

double wold[5] ,wnew[5] ,w[5] ,hh,hi,c,fac,a,b; 

Double precision on internal calculations even though the interface is in single precision. 


> 


hh=h; 

hi=l.0/hh; 

for (j=l;j<=n;j++) wghts[j]=0.0; 

Zero all the weights so we can sum into them. 

(*kermom) (wold,0.0,4); Evaluate indefinite integrals at lower end. 

if (n >= 4) { Use highest available order. 

b=0.0; For another problem, you might change 

for (j=l; j<=n-3; j++) { this lower limit. 

c=j-l; This is called k in equation (18.3.5). 

a=b; Set upper and lower limits for this step. 

b=a+hh; 


if (j == n-3) b=(n-l)*hh; Last interval: go all the way to end. 

(♦kermom)(wnew,b,4); 

for (fac=1.0,k=l;k<=4;k++,fac*=hi) Equation (18.3.4). 

w [k] = (wnew [k] -wold [k] ) *f ac; 

wghts [j] += ( Equation (18.3.5). 

((c+1.0) * (c+2.0) * (c+3.0) *w [1] 

-(11.0+c*(12.0+c*3.0))*w[2] 

+3.0*(c+2. 0) *w[3] -w [4] )/6.0) ; 
wghts [j+1] += ( 

(-c*(c+2.0)*(c+3.0)*w[l] 

+(6.0+c*(10.0+c*3.0))*w[2] 

-(3.0*c+5.0)*w[3] +w [4] )*0.5) ; 
wghts [j+2] += ( 

(c* (c+1.0) * (c+3.0) *w [1] 

-(3.0+c*(8.0+c*3.0))*w [2] 

+ (3.0*c+4.0) *w [3] -w [4] ) *0.5) ; 
wghts [j+3] += ( 

(-c*(c+1.0)*(c+2.0)*w[l] 

+(2.0+c*(6.0+c*3.0))*w[2] 

-3.0* (c+1.0) *w [3] +w [4] ) /6.0) ; 

for (k=l;k<=4;k++) wold[k] =wnew[k] ; Reset lower limits for moments. 


> 

> else if (n == 3) { Lower-order cases; not recommended. 

(*kermom)(wnew,hh+hh,3); 
w [1] =wnew [1] -wold [1] ; 
w [2] =hi* (wnew [2] -wold [2] ); 
w [3] =hi*hi* (wnew [3] -wold [3] ) ; 
wghts [1] =w [1] -1.5*w [2] +0.5*w [3] ; 
wghts [2] =2 .0*w [2] -w [3] ; 
wghts [3] =0.5* (w [3] -w [2] ); 

} else if (n == 2) { 

(*kermom)(wnew,hh,2); 

wghts [1] =wnew[l] -wold[1] - (wghts [2] =hi*(wnew[2] -wold[2] )); 



We will now give an example of how to apply wwghts to a singular integral equation. 
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Worked Example: A Diagonally Singular Kernel 


As a particular example, consider the integral equation 

f(x)+ [ K(x,y)f(y)dy = sin* (18.3.13) 

JO 

with the (arbitrarily chosen) nasty kernel 

K(x,y)=cosxcosyx j (18.3.14) 

which has a logarithmic singularity on the left of the diagonal, combined with a square-root 
discontinuity on the right. 

The first step is to do (analytically, in this case) the required moment integrals over 
the singular part of the kernel, equation (18.3.12). Since these integrals are done at a fixed 
value of x, we can use x as the lower limit. For any specified value of y, the required 
indefinite integral is then either 


F m (y;x)'~ J s m (s — x) x ^ 2 ds = J (x + t) m t l ^dt iiy>x (18.3.15) 


F m (y, 




s m ln(a; — s)ds = 


= r\x~ 

j 0 


t) m lntdt ify<x (18.3.16) 


(where a change of variable has been made in the second equality in each case). Doing these 
integrals analytically (actually, we used a symbolic integration package!), we package the 
resulting formulas in the following routine. Note that w( j + 1) returns F, (y; x). 


#include <math.h> 

extern double x; Defined in quadmx. 

void kermom(double w[], double y, int m) 

Returns in w[l. .m] the first m indefinite-integral moments of one row of the singular part of 
the kernel. (For this example, m is hard-wired to be 4.) The input variable y labels the column, 
while the global variable x is the row. We can take x as the lower limit of integration. Thus, 
we return the moment integrals either purely to the left or purely to the right of the diagonal, 
f 

double d,df,clog,x2,x3,x4,y2; 

if (y >= x) { 
d=y-x; 

df=2.0*sqrt(d)*d; 

w[l]=df/3.0; 

w[2]=df*(x/3.0+d/5.0); 

w[3]=df*((x/3.0 + 0.4*d)*x + d*d/7.0); 

w[4]=df*(((x/3.0 + 0.6*d)*x + 3.0*d*d/7.0)*x+d*d*d/9.0); 

> else { 

x3= (x2=x*x) *x; 
x4=x2*x2; 
y2=y*y; 
d=x-y; 

w[l]=d*((clog=log(d))-1.0); 

u[2] = -0.25*(3.0*x+y-2.0*clog*(x+y))*d; 

w [3] = (-11.0*x3+y*(6.0*x2+y*(3.0*x+2.0*y)) 

+6.0*clog*(x3-y*y2))/18.0; 
w[4] = (-25.0*x4+y*(12.0*x3+y*(6.0*x2+y* 

(4.0*x+3.0*y)))+12.0*clog*(x4-(y2*y2)))/48.0; 



> 


> 
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Next, we write a routine that constructs the quadrature matrix. 


#include <math.h> 

#include "nrutil.h" 

#define PI 3.14159265 

double x; Communicates with kermom. 

void quadmx(float **a, int n) 

Constructs in a[l. .n] [1. .n] the quadrature matrix for an example Fredholm equation of 
the second kind. The nonsingular part of the kernel is computed within this routine, while 
the quadrature weights which integrate the singular part of the kernel are obtained via calls 
to wwghts. An external routine kermom, which supplies indefinite-integral moments of the 
singular part of the kernel, is passed to wwghts. 

{ 

void kermom(double w[], double y, int m); 
void wwghts(float wghts[], int n, float h, 
void (*kermom)(double [J, double ,int)); 
int j,k; 

float h,*wt,xx,cx; 

wt=vector(l,n); 

h=PI/(n-l); 

for (j=l;j<=n;j++) { 

x=xx=(j-l)*h; Put x in global variable for use by kermom. 

wwghts(wt,n,h,kermom); 

cx=cos(xx) ; Part of nonsingular kernel, 

for (k=l;k<=n;k++) a[j] [k] =wt [k] *cx*cos((k-1)*h); 

Put together all the pieces of the kernel. 

++a[j] [j] ; Since equation of the second kind, there is diagonal piece 

> independent of h. 

free_vector(wt,l,n); 


Finally, we solve the linear system for any particular right-hand side, here sin x. 


#include <stdio.h> 

#include <math.h> 

#include "nrutil.h" 

#define PI 3.14159265 

#define N 40 Here the size of the grid is specified. 


int main(void) /* Program fredex */ 

This sample program shows how to solve a Fredholm equation of the second kind using the 
product Nystrom method and a quadrature rule especially constructed for a particular, singular, 
kernel. 

{ 

void lubksb(float **a, int n, int *indx, float b[]); 
void ludcmp(float **a, int n, int *indx, float *d); 
void quadmx(float **a, int n); 
float **a,d,*g,x; 
int *indx,j; 


indx=ivector(l,N); 
a=matrix(l,N,1,M); 
g=vector(l,N); 

quadmx(a,N); Make the quadrature matrix; all the action is here. 

ludcmp(a,N,indx,&d); Decompose the matrix, 

for (j=l; j<=N; j++) g[j]=sin((j-l)*PI/(N-l)); 

Construct the right hand side, here sina:. 
lubksb(a,N, indx, g) ; Backsubstitute. 

for (j=l;j<=N;j++) { Write out the solution. 

x=(j-l)*PI/(N—1); 
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Figure 18.3.1. Solution of the example integral equation (18.3.14) with grid sizes N = 10, 20, and 40. 
The tabulated solution values have been connected by straight lines; in practice one would interpolate 
a small N solution more smoothly. 


printf 07„6.2d , /„12.6f '/.12.6f \n", j ,x,g[j]); 

> 

free_vector(g,1,N); 
free_matrix(a,1,N,1,N); 
free.ivector(indx,1,N); 
return 0; 


With N = 40, this program gives accuracy at about the 10 -5 level. The accuracy 
increases as N 4 (as it should for our Simpson-order quadrature scheme) despite the highly 
singular kernel. Figure 18.3.1 shows the solution obtained, also plotting the solution for 
smaller values of N, which are themselves seen to be remarkably faithful. Notice that the 
solution is smooth, even though the kernel is singular, a common occurrence. 


CITED REFERENCES AND FURTHER READING: 

Abramowitz, M., and Stegun, I.A. 1964, Handbook of Mathematical Functions, Applied Mathe¬ 
matics Series, Volume 55 (Washington: National Bureau of Standards; reprinted 1968 by 
Dover Publications, New York). [1] 

Stroud, A.H., and Secrest, D. 1966, Gaussian Quadrature Formulas (Englewood Cliffs, NJ: 
Prentice-Hall). [2] 

Delves, L.M., and Mohamed, J.L. 1985, Computational Methods for Integral Equations (Cam¬ 
bridge, U.K.: Cambridge University Press). [3] 

Atkinson, K.E. 1976, A Survey of Numerical Methods for the Solution of Fredholm Integral 
Equations of the Second Kind (Philadelphia: S.I.A.M.). [4] 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 







804 


Chapter 18. Integral Equations and Inverse Theory 


18.4 Inverse Problems and the Use of A Priori 
Information 


Later discussion will be facilitated by some preliminary mention of a couple 
of mathematical points. Suppose that u is an “unknown” vector that we plan to 
determine by some minimization principle. Let _4[u] >0 and B[ u] >0 be two 
positive functionals of u, so that we can try to determine u by either 

minimize: „4[u] or minimize: £>[u] (18.4.1) 

(Of course these will generally give different answers for u.) As another possibility, 
now suppose that we want to minimize Al[u] subject to the constraint that B[ u] have 
some particular value, say b. The method of Lagrange multipliers gives the variation 

^ {A[n] + A!(B[u] - b)} = ^ (A[u\ + ArB[u]) = 0 (18.4.2) 

where Ai is a Lagrange multiplier. Notice that b is absent in the second equality, 
since it doesn’t depend on u. 

Next, suppose that we change our minds and decide to minimize B[ u] subject 
to the constraint that Al[u] have a particular value, a. Instead of equation (18.4.2) 
we have 

^ {B[ u] + A 2 GA[u] - a)} = A ( B [ u ] + \ 2 A[u}) = 0 (18.4.3) 

with, this time, A 2 the Lagrange multiplier. Multiplying equation (18.4.3) by the 
constant I/A 2 , and identifying I/A 2 with Ai, we see that the actual variations are 
exactly the same in the two cases. Both cases will yield the same one-parameter 
family of solutions, say, u(Ai). As Ai varies from 0 to 00 , the solution u(Ai) 
varies along a so-called trade-off curve between the problem of minimizing A and 
the problem of minimizing B. Any solution along this curve can equally well 
be thought of as either (i) a minimization of A for some constrained value of B, 
or (ii) a minimization of B for some constrained value of A, or (iii) a weighted 
minimization of the sum A + Ai B. 

The second preliminary point has to do with degenerate minimization principles. 
In the example above, now suppose that _4.[u] has the particular form 

A[ u] = |A-u-c| 2 (18.4.4) 

for some matrix A and vector c. If A has fewer rows than columns, or if A is square 
but degenerate (has a nontrivial nullspace, see §2.6, especially Figure 2.6.1), then 
minimizing Al[u] will not give a unique solution for u. (To see why, review §15.4, 
and note that for a “design matrix” A with fewer rows than columns, the matrix 
A T • A in the normal equations 15.4.10 is degenerate.) However, if we add any 
multiple A times a nondegenerate quadratic form B[u], for example u H • u with H 
a positive definite matrix, then minimization of Al[u] + A£>[u] will lead to a unique 
solution for u. (The sum of two quadratic forms is itself a quadratic form, with the 
second piece guaranteeing nondegeneracy.) 
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We can combine these two points, for this conclusion: When a quadratic 
minimization principle is combined with a quadratic constraint, and both are positive, 
only one of the two need be nondegenerate for the overall problem to be well-posed. 
We are now equipped to face the subject of inverse problems. 

The Inverse Problem with Zeroth-Order Regularization 

Suppose that u(x) is some unknown or underlying (u stands for both unknown 
and underlying!) physical process, which we hope to determine by a set of N 
measurements Ci, i = 1,2,..., N. The relation between u(x) and the c,’s is that 
each Ci measures a (hopefully distinct) aspect of u(x) through its own linear response 
kernel Xi, and with its own measurement error rii. In other words, 

Ci = Si + rii = J Ti{x)u{x)dx + rii (18.4.5) 

(compare this to equations 13.3.1 and 13.3.2). Within the assumption of linearity, 
this is quite a general formulation. The Cj’s might approximate values of u(x') at 
certain locations Xi, in which case Xi(x) would have the form of a more or less 
narrow instrumental response centered around x = x i . Or, the c,’s might “live” in an 
entirely different function space from u(x), measuring different Fourier components 
of u(x) for example. 

The inverse problem is, given the c,’s, the r j (x) ’s, and perhaps some information 
about the errors n, such as their covariance matrix 

Sij = Covar[rtj, nj] (18.4.6) 

how do we find a good statistical estimator of u(x), call it u(x)? 

It should be obvious that this is an ill-posed problem. After all, how can we 
reconstruct a whole function u(x) from only a finite number of discrete values cp. 
Yet, whether formally or informally, we do this all the time in science. We routinely 
measure “enough points” and then “draw a curve through them.” In doing so, we 
are making some assumptions, either about the underlying function u{x), or about 
the nature of the response functions ri(x), or both. Our purpose now is to formalize 
these assumptions, and to extend our abilities to cases where the measurements and 
underlying function live in quite different function spaces. (How do you “draw a 
curve” through a scattering of Fourier coefficients?) 

We can’t really want every point x of the function u(x'). We do want some 
large number M of discrete points x^, p = 1,2,..., M, where M is sufficiently 
large, and the x M ’s are sufficiently evenly spaced, that neither u(x) nor r*(x) varies 
much between any x tl and a: /i+ -|. (Here and following we will use Greek letters like 
p to denote values in the space of the underlying process, and Roman letters like i 
to denote values of immediate observables.) For such a dense set of x ;t ’s, we can 
replace equation (18.4.5) by a quadrature like 

Ci = £ Ri^Xp) + rii (18.4.7) 

A* 

where the N x M matrix R has components 



Ri/j, — 1-1 %iJt— 1)/2 


(18.4.8) 
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(or any other simple quadrature — it rarely matters which). We will view equations 
(18.4.5) and (18.4.7) as being equivalent for practical purposes. 

How do you solve a set of equations like equation (18.4.7) for the unknown 
u(xn)’s1 Here is a bad way, but one that contains the germ of some correct ideas: 
Form a x 2 measure of how well a model u(x) agrees with the measured data, 

N N r M 

c i~^2 R i^( x n 
i= 1 j =1 L A*=l 

a - 

(compare with equation 15.1.5). Here S -1 is the inverse of the covariance matrix, 
and the approximate equality holds if you can neglect the off-diagonal covariances, 
with <ji = (Covar^i]) 1 / 2 . 

Now you can use the method of singular value decomposition (SVD) in §15.4 
to find the vector u that minimizes equation (18.4.9). Don’t try to use the method 
of normal equations; since M is greater than N they will be singular, as we already 
discussed. The SVD process will thus surely find a large number of zero singular 
values, indicative of a highly non-unique solution. Among the infinity of degenerate 
solutions (most of them badly behaved with arbitrarily large u(a; M )’s) SVD will 
select the one with smallest |u| in the sense of 

^[^(a:^)] 2 a minimum (18.4.10) 

(look at Figure 2.6.1). This solution is often called the principal solution. It 
is a limiting case of what is called zeroth-order regularization, corresponding to 
minimizing the sum of the two positive functionals 

minimize: x 2 [u] -h A(u • u) (18.4.11) 

in the limit of small A. Below, we will learn how to do such minimizations, as well 
as more general ones, without the ad hoc use of SVD. 

What happens if we determine u by equation (18.4.11) with a non-infinitesimal 
value of A? First, note that if M » N (many more unknowns than equations), then 
u will often have enough freedom to be able to make x ' 2 (equation 18.4.9) quite 
unrealistically small, if not zero. In the language of §15.1, the number of degrees of 
freedom v = N — M, which is approximately the expected value of x 2 when v is 
large, is being driven down to zero (and, not meaningfully, beyond). Yet, we know 
that for the true underlying function u(x), which has no adjustable parameters, the 
number of degrees of freedom and the expected value of x 2 should be about v k, N. 

Increasing A pulls the solution away from minimizing x 2 in favor of minimizing 
u • u. From the preliminary discussion above, we can view this as minimizing u • u 
subject to the constraint that x 2 have some constant nonzero value. A popular 
choice, in fact, is to find that value of A which yields x 2 = N, that is, to get about as 
much extra regularization as a plausible value of x 2 dictates. The resulting u(x) is 
called the solution of the inverse problem with zeroth-order regularization. 
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Figure 18.4.1. Almost all inverse problem methods involve a trade-off between two optimizations: 
agreement between data and solution, or “sharpness” of mapping between true and estimated solution (here 
denoted A), and smoothness or stability of the solution (here denoted B). Among all possible solutions, 
shown here schematically as the shaded region, those on the boundary connecting the unconstrained 
minimum of A and the unconstrained minimum of B are the “best” solutions, in the sense that every 
other solution is dominated by at least one solution on the curve. 

The value N is actually a surrogate for any value drawn from a Gaussian 
distribution with mean N and standard deviation (2 N ) 1 ^ 2 (the asymptotic x 2 
distribution). One might equally plausibly try two values of A, one giving x 2 = 
N + (2JV) 1 / 2 , the other N - (27V) 1/2 • 

Zeroth-order regularization, though dominated by better methods, demonstrates 
most of the basic ideas that are used in inverse problem theory. In general, there are 
two positive functionals, call them A and B. The first, A, measures something like 
the agreement of a model to the data (e.g., x 2 ), or sometimes a related quantity like 
the “sharpness” of the mapping between the solution and the underlying function. 
When A by itself is minimized, the agreement or sharpness becomes very good 
(often impossibly good), but the solution becomes unstable, wildly oscillating, or in 
other ways unrealistic, reflecting that A alone typically defines a highly degenerate 
minimization problem. 

That is where B comes in. It measures something like the “smoothness” of the 
desired solution, or sometimes a related quantity that parametrizes the stability of 
the solution with respect to variations in the data, or sometimes a quantity reflecting 
a priori judgments about the likelihood of a solution. B is called the stabilizing 
functional or regularizing operator. In any case, minimizing B by itself is supposed 
to give a solution that is “smooth” or “stable” or “likely” — and that has nothing 
at all to do with the measured data. 



S, § g 
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The single central idea in inverse theory is the prescription 

minimize: A + XB (18.4.12) 

for various values of 0 < A < oo along the so-called trade-off curve (see Figure 
18.4.1), and then to settle on a “best” value of A by one or another criterion, ranging 
from fairly objective (e.g., making % 2 = N ) to entirely subjective. Successful 
methods, several of which we will now describe, differ as to their choices of A and 
B, as to whether the prescription (18.4.12) yields linear or nonlinear equations, as 
to their recommended method for selecting a final A, and as to their practicality for 
computer-intensive two-dimensional problems like image processing. 

They also differ as to the philosophical baggage that they (or rather, their 
proponents) carry. We have thus far avoided the word “Bayesian.” (Courts have 
consistently held that academic license does not extend to shouting “Bayesian” in a 
crowded lecture hall.) But it is hard, nor have we any wish, to disguise the fact that 
B has something to do with a priori expectation, or knowledge, of a solution, while 
A has something to do with a posteriori knowledge. The constant A adjudicates a 
delicate compromise between the two. Some inverse methods have acquired a more 
Bayesian stamp than others, but we think that this is purely an accident of history. 
An outsider looking only at the equations that are actually solved, and not at the 
accompanying philosophical justifications, would have a difficult time separating the 
so-called Bayesian methods from the so-called empirical ones, we think. 

The next three sections discuss three different approaches to the problem of 
inversion, which have had considerable success in different fields. All three fit 
within the general framework that we have outlined, but they are quite different in 
detail and in implementation. 

CITED REFERENCES AND FURTHER READING: 

Craig, I.J.D., and Brown, J.C. 1986, Inverse Problems in Astronomy (Bristol, U.K.: Adam Hilger). 
Twomey, S. 1977, Introduction to the Mathematics of Inversion in Remote Sensing and Indirect 
Measurements (Amsterdam: Elsevier). 

Tikhonov, A.N., and Arsenin, V.Y. 1977, Solutions of Ill-Posed Problems (New York: Wiley). 
Tikhonov, A.N., and Goncharsky, A.V. (eds.) 1987, Ill-Posed Problems in the Natural Sciences 
(Moscow: MIR). 

Parker, R.L. 1977, Annual Review of Earth and Planetary Science, vol. 5, pp. 35-64. 

Frieden, B.R. 1975, in Picture Processing and Digital Filtering, T.S. Huang, ed. (New York: 
Springer-Verlag). 

Tarantola, A. 1987, Inverse Problem Theory! Amsterdam: Elsevier). 

Baumeister, J. 1987, Stable Solution of Inverse Problems (Braunschweig, Germany: Friedr. Vieweg 
& Sohn) [mathematically oriented]. 

Titterington, D.M. 1985, Astronomy and Astrophysics, vol. 144, pp. 381-387. 

Jeffrey, W., and Rosner, R. 1986, Astrophysical Journal, vol. 310, pp. 463-472. 


18.5 Linear Regularization Methods 

What we will call linear regularization is also called the Phillips-Twomey 
method [1,2], the constrained linear inversion method [3], the method of regulariza¬ 
tion [4], and Tikhonov-Miller regularization [5-7], (It probably has other names also, 
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since it is so obviously a good idea.) In its simplest form, the method is an immediate 
generalization of zeroth-order regularization (equation 18.4.11, above). As before, 
the functional A is taken to be the % 2 deviation, equation (18.4.9), but the functional 
B is replaced by more sophisticated measures of smoothness that derive from first 
or higher derivatives. 

For example, suppose that your a priori belief is that a credible u(x) is not too 
different from a constant. Then a reasonable functional to minimize is 

/ M—l 

[u'(x)] 2 dx oc ^ — Ufj. + 1 ] 2 (18.5.1) 

M =i 

since it is nonnegative and equal to zero only when u(x) is constant. Here 
u tl = u(xn), and the second equality (proportionality) assumes that the x^’s are 
uniformly spaced. We can write the second form of B as 

B = |B u | 2 = u (B r B) u = u H u (18.5.2) 

where u is the vector of components u p . p 1,..., M, B is the (M — 1) x M 
first difference matrix 

7-1 1 0 0 0 0 0 ••• 0\ 

0 -1 1 0 0 0 0 ••• 0 

B= : : (18.5.3) 

0 ••• 0 0 0 0 -1 1 0 

\ 0 ••• 0 0 0 0 0 -1 1/ 

and H is the M x M matrix 

/ 1 -1 0 0 0 0 0 ••• 0\ 

-1 2 -1 0 0 0 0 ••• 0 

0 -1 2 -1 0 0 0 ••• 0 

H = B r B= : : (18.5.4) 

0 ••• 0 0 0 -1 2 -1 0 

0 ••• 0 0 0 0 -1 2 -1 
V 0 ••• 0 0 0 0 0 -1 1/ 

Note that B has one fewer row than column. It follows that the symmetric H 
is degenerate; it has exactly one zero eigenvalue corresponding to the value of a 
constant function, any one of which makes B exactly zero. 

If, just as in §15.4, we write 

Ain — Rin/ a i h = Cj/o-j (18.5.5) 

then, using equation (18.4.9), the minimization principle (18.4.12) is 

minimize: A + XB = |A • u — b| 2 + Au • H • u (18.5.6) 

This can readily be reduced to a linear set of normal equations, just as in §15.4: The 
components of the solution satisfy the set of M equations in M unknowns, 

H wZ A m A ip) +*Hn P u p = ^2Ainbi p=l,2,...,M (18.5.7) 

P IA i ' J i 
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or, in vector notation, 

(A T -A + AH)-u = A T -b (18.5.8) 

Equations (18.5.7) or (18.5.8) can be solved by the standard techniques of 
Chapter 2, e.g., LU decomposition. The usual warnings about normal equations 
being ill-conditioned do not apply, since the whole purpose of the A term is to cure 
that same ill-conditioning. Note, however, that the A term by itself is ill-conditioned, 
since it does not select a preferred constant value. You hope your data can at 
least do that\ 

Although inversion of the matrix (A T • A + AH) is not generally the best way to 
solve for u, let us digress to write the solution to equation (18.5.8) schematically as 

u = ^ —— • A t • A^ A” 1 • b (schematic only!) (18.5.9) 

where the identity matrix in the form A ■ A 1 has been inserted. This is schematic 
not only because the matrix inverse is fancifully written as a denominator, but 
also because, in general, the inverse matrix A -1 does not exist. However, it is 
illuminating to compare equation (18.5.9) with equation (13.3.6) for optimal or 
Wiener filtering, or with equation (13.6.6) for general linear prediction. One sees 
that A t • A plays the role of S' 2 , the signal power or autocorrelation, while AH 
plays the role of N 2 , the noise power or autocorrelation. The term in parentheses 
in equation (18.5.9) is something like an optimal filter, whose effect is to pass the 
ill-posed inverse A -1 • b through unmodified when A T • A is sufficiently large, but 
to suppress it when A T • A is small. 

The above choices of B and H are only the simplest in an obvious sequence of 
derivatives. If your a priori belief is that a linear function is a good approximation 
to u(x'), then minimize 
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This H has two zero eigenvalues, corresponding to the two undetermined parameters 
of a linear function. 

If your a priori belief is that a quadratic function is preferable, then minimize 


r M -3 

B a / ^,'"{x)] 2 dx oc ^2 [-Up + 3u m+ i - 3u jLl+ 2 + u M+ 3] 2 (18.5.13) 


with 


/-I 3-3 1 0 0 


0-1 3-3 


0 

\ 0 


0 0 


0 0-1 3-3 1 
0 0 0 -1 3 -3 


o\ 

0 

0 

1 ) 


(18.5.14) 


and now 
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V 

0 
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0 

-1 

3 

-3 

1/ 


(18.5.15) 


(We’ll leave the calculation of cubics and above to the compulsive reader.) 

Notice that you can regularize with “closeness to a differential equation,” if 
you want. Just pick B to be the appropriate sum of finite-difference operators (the 
coefficients can depend on x), and calculate H = B T • B. You don’t need to know 
the values of your boundary conditions, since B can have fewer rows than columns, 
as above; hopefully, your data will determine them. Of course, if you do know some 
boundary conditions, you can build these into B too. 

With all the proportionality signs above, you may have lost track of what actual 
value of A to try first. A simple trick for at least getting “on the map” is to first try 



A = Tr(A T • A)/Tr(H) (18.5.16) 11 

g. | 

^ S; 

where Tr is the trace of the matrix (sum of diagonal components). This choice ™ § 

will tend to make the two parts of the minimization have comparable weights, and 
you can adjust from there. 

As for what is the “correct” value of A, an objective criterion, if you know 
your errors cr, with reasonable accuracy, is to make \ 2 (that is, |A • u — b| 2 ) equal 
to N, the number of measurements. We remarked above on the twin acceptable 
choices N ± (2 N) 1 / 2 . A subjective criterion is to pick any value that you like in the 
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range 0 < A < oo, depending on your relative degree of belief in the a priori and a 
posteriori evidence. (Yes, people actually do that. Don’t blame us.) 

Two-Dimensional Problems and Iterative Methods 


Up to now our notation has been indicative of a one-dimensional problem, 
finding u(x) or u tl = u(x M ). However, all of the discussion easily generalizes to the 
problem of estimating a two-dimensional set of unknowns p = 1,... ,M, k = 
1 ,... ,K, corresponding, say, to the pixel intensities of a measured image. In this 
case, equation (18.5.8) is still the one we want to solve. 

In image processing, it is usual to have the same number of input pixels in a 
measured “raw” or “dirty” image as desired “clean” pixels in the processed output 
image, so the matrices R and A (equation 18.5.5) are square and of size MK x MK. 
A is typically much too large to represent as a full matrix, but often it is either (i) 
sparse, with coefficients blurring an underlying pixel (i.j) only into measurements 
(i ±few, j ±few), or (ii) translationally invariant, so that A {i,j)(n >v ) = A(i — p, j — u). 
Both of these situations lead to tractable problems. 

In the case of translational invariance, fast Fourier transforms (FFTs) are the 
obvious method of choice. The general linear relation between underlying function 
and measured values (18.4.7) now becomes a discrete convolution like equation 
(13.1.1). Ifk denotes a two-dimensional wave-vector, then the two-dimensional FFT 
takes us back and forth between the transform pairs 


A(i — p,j— v) 4=> A(k) 4=> 6(k) 4=>- w(k) (18.5.17) 

We also need a regularization or smoothing operator B and the derived H = B T B. 
One popular choice for B is the five-point finite-difference approximation of the 
Laplacian operator, that is, the difference between the value of each point and the 
average of its four Cartesian neighbors. In Fourier space, this choice implies, 


.B(k) oc sm 2 (-Kk\/M)siv?('Kk2/K) 
H{ k) oc sin 4 (7r/ci/M) sin 4 (7rfc2/Ff) 


(18.5.18) 



In Fourier space, equation (18.5.7) is merely algebraic, with solution 



|A(k)|2 + Aff(k) 

where asterisk denotes complex conjugation. You can make use of the FFT routines 
for real data in §12.5. 

Turn now to the case where A is not translationally invariant. Direct solution 
of (18.5.8) is now hopeless, since the matrix A is just too large. We need some 
kind of iterative scheme. 

One way to proceed is to use the full machinery of the conjugate gradient 
method in §10.6 to find the minimum of A + A B, equation (18.5.6). Of the various 
methods in Chapter 10, conjugate gradient is the unique best choice because (i) 
it does not require storage of a Hessian matrix, which would be infeasible here. 
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and (ii) it does exploit gradient information, which we can readily compute: The 
gradient of equation (18.5.6) is 

V(-4 + A B) = 2[(A r • A + AH) • u - A t • b] (18.5.20) 

(cf. 18.5.8). Evaluation of both the function and the gradient should of course take 
advantage of the sparsity of A, for example via the routines sprsax and sprstx 
in §2.7. We will discuss the conjugate gradient technique further in §18.7, in the 
context of the (nonlinear) maximum entropy method. Some of that discussion can 
apply here as well. 

The conjugate gradient method notwithstanding, application of the unsophis¬ 
ticated steepest descent method (see §10.6) can sometimes produce useful results, 
particularly when combined with projections onto convex sets (see below). If the 
solution after k iterations is denoted , then after k + 1 iterations we have 

u (fe+1) = [1 - e(A T • A + AH)] • u (fe) + eA T • b (18.5.21) 


Here e is a parameter that dictates how far to move in the downhill gradient direction. 
The method converges when e is small enough, in particular satisfying 


0 < e < 


_ 2 _ 

max eigenvalue (A T • A + AH) 


(18.5.22) 


There exist complicated schemes for finding optimal values or sequences for e, 
see [7]; or, one can adopt an experimental approach, evaluating (18.5.6) to be sure 
that downhill steps are in fact being taken. 

In those image processing problems where the final measure of success is 
somewhat subjective (e.g., “how good does the picture look?”), iteration (18.5.21) 
sometimes produces significantly improved images long before convergence is 
achieved. This probably accounts for much of its use, since its mathematical 
convergence is extremely slow. In fact, (18.5.21) can be used with H = 0, in which 
case the solution is not regularized at all, and full convergence would be disastrous! 
This is called Van Cittert’s method and goes back to the 1930s. A number of 
iterations the order of 1000 is not uncommon [7], 


Deterministic Constraints: Projections onto Convex Sets 

A set of possible underlying functions (or images) {u} is said to be convex if, 
for any two elements u a and m, in the set, all the linearly interpolated combinations 

(1 — ? 7 )u a + ijvib 0 < rj < 1 (18.5.23) 

are also in the set. Many deterministic constraints that one might want to impose on 
the solution u to an inverse problem in fact define convex sets, for example: 

• positivity 

• compact support (i.e., zero value outside of a certain region) 
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• known bounds (i.e., ul(x) < u(x) < uu(x) for specified functions u l 
and uu). 

(In this last case, the bounds might be related to an initial estimate and its error bars, 
e.g., uo(x) ± 7 <7 (a;), where 7 is of order 1 or 2.) Notice that these, and similar, 
constraints can be either in the image space, or in the Fourier transform space, or (in 
fact) in the space of any linear transformation of u. 

If Ci is a convex set, then Vi is called a nonexpansive projection operator onto 
that set if (i) Vi leaves unchanged any u already in 6', , and (ii) V t maps any u outside 
Ci to the closest element of C t , in the sense that 

\Vjn — u| < |u a — u| for all u a in Q (18.5.24) 

While this definition sounds complicated, examples are very simple: A nonexpansive 
projection onto the set of positive u’s is “set all negative components of u equal 
to zero.” A nonexpansive projection onto the set of u(x)’s bounded by ul(x ) < 
u(x) < Uu(x) is “set all values less than the lower bound equal to that bound, and 
set all values greater than the upper bound equal to that bound.” A nonexpansive 
projection onto functions with compact support is “zero the values outside of the 
region of support.” 

The usefulness of these definitions is the following remarkable theorem: Let C 
be the intersection of m convex sets Ci,C- 2 . ■■ ■■ C m . Then the iteration 

u (fc+1) = (ViV 2 ■ ■ ■ V m ) u (fe) (18.5.25) 

will converge to C from all starting points, as k —> 00 . Also, if C is empty (there 
is no intersection), then the iteration will have no limit point. Application of this 
theorem is called the method of projections onto convex sets or sometimes POCS [7], 
A generalization of the POCS theorem is that the Vfs can be replaced by 
a set of Tfs, 



% = 1 . 4 * $0i - 1 ) 0 < ^ < 2 (18.5.26) 

A well-chosen set of /Vs can accelerate the convergence to the intersection set C. 

Some inverse problems can be completely solved by iteration (18.5.25) alone! 
For example, a problem that occurs in both astronomical imaging and X-ray 
diffraction work is to recover an image given only the modulus of its Fourier 
transform (equivalent to its power spectrum or autocorrelation) and not the phase. 
Here three convex sets can be utilized: the set of all images whose Fourier transform 
has the specified modulus to within specified error bounds; the set of all positive 
images; and the set of all images with zero intensity outside of some specified region. 
In this case the POCS iteration (18.5.25) cycles among these three, imposing each 
constraint in ton; FFTs are used to get in and out of Fourier space each time the 
Fourier constraint is imposed. 

The specific application of POCS to constraints alternately in the spatial and 
Fourier domains is also known as the Gerchberg-Saxton algorithm [8], While this 
algorithm is non-expansive, and is frequently convergent in practice, it has not been 
proved to converge in all cases [9], In the phase-retrieval problem mentioned above, 
the algorithm often “gets stuck” on a plateau for many iterations before making 
sudden, dramatic improvements. As many as 10 4 to 10 5 iterations are sometimes 
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necessary. (For “unsticking” procedures, see [10].) The uniqueness of the solution 
is also not well understood, although for two-dimensional images of reasonable 
complexity it is believed to be unique. 

Deterministic constraints can be incorporated, via projection operators, into 
iterative methods of linear regularization. In particular, rearranging terms somewhat, 
we can write the iteration (18.5.21) as 

u (fc+1) = (1 - eAH) • u (fe) + eA T ■ (b - A • u (fe) ) (18.5.27) 

If the iteration is modified by the insertion of projection operators at each step 

= (V 1 V 2 ■ ■ ■ V m )[{\ - eAH) • n (k) + eA T • (b - A • u (fe) )] (18.5.28) 

(or, instead of TVs, the % operators of equation 18.5.26), then it can be shown that 
the convergence condition (18.5.22) is unmodified, and the iteration will converge 
to minimize the quadratic functional (18.5.6) subject to the desired nonlinear 
deterministic constraints. See [7] for references to more sophisticated, and faster 
converging, iterations along these lines. 
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18.6 Backus-Gilbert Method 

The Backus-Gilbert method [1,2] (see, e.g., [3] or [4] for summaries) differs from 
other regularization methods in the nature of its functionals A and B. For B, the 
method seeks to maximize the stability of the solution u(x ') rather than, in the first 
instance, its smoothness. That is, 



B = Var[u(a;)] 


(18.6.1) 
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is used as a measure of how much the solution u(x) varies as the data vary within 
their measurement errors. Note that this variance is not the expected deviation of 
u(x) from the true u(x) — that will be constrained by A — but rather measures 
the expected experiment-to-experiment scatter among estimates u(x) if the whole 
experiment were to be repeated many times. 

For A the Backus-Gilbert method looks at the relationship between the solution 
u(x) and the true function u(x), and seeks to make the mapping between these as 
close to the identity map as possible in the limit of error-free data. The method is 
linear, so the relationship between u(x) and u(x) can be written as 


u(x) 


J 5(x,x')u(x')dx' 


(18.6.2) 


for some so-called resolution function or averaging kernel S(x, x'). The Backus- 
Gilbert method seeks to minimize the width or spread of 5 (that is, maximize the 
resolving power). A is chosen to be some positive measure of the spread. 

While Backus-Gilbert’s philosophy is thus rather different from that of Phillips- 
Twomey and related methods, in practice the differences between the methods are 
less than one might think. A stable solution is almost inevitably bound to be 
smooth: The wild, unstable oscillations that result from an unregularized solution 
are always exquisitely sensitive to small changes in the data. Likewise, making 
u(x) close to u(x) inevitably will bring error-free data into agreement with the 
model. Thus A and B play roles closely analogous to their corresponding roles 
in the previous two sections. 

The principal advantage of the Backus-Gilbert formulation is that it gives good 
control over just those properties that it seeks to measure, namely stability and 
resolving power. Moreover, in the Backus-Gilbert method, the choice of A (playing 
its usual role of compromise between A and B) is conventionally made, or at least 
can easily be made, before any actual data are processed. One’s uneasiness at making 
a post hoc , and therefore potentially subjectively biased, choice of A is thus removed. 
Backus-Gilbert is often recommended as the method of choice for designing, and 
predicting the performance of, experiments that require data inversion. 

Let’s see how this all works. Starting with equation (18.4.5), 

ct = Si + rii = J ri{x)u{x)dx + rii (18.6.3) 

and building in linearity from the start, we seek a set of inverse response kernels 
q, (x) such that 



u(x) = ^2 q i( x ) c i 


(18.6.4) 




is the desired estimator of u[x). It is useful to define the integrals of the response 
kernels for each data point, 


Ri = 


\{x)dx 


(18.6.5) 
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Substituting equation (18.6.4) into equation (18.6.3), and comparing with equation 
(18.6.2), we see that 


6(x,x') = ^2qi(x)ri( x') 


(18.6.6) 


We can require this averaging kernel to have unit area at every x, giving 


1 = J S(x,x')dx' = ^^qi(x) J ri(x')dx'= ^2qi(x)Ri = q(x) ■ R (18.6.7) 


where q(ai) and R are each vectors of length N, the number of measurements. 
Standard propagation of errors, and equation (18.6.1), give 

B = Var[u(a:)] = ^2^q i {x)Si j q j {x) = q(x) • S • q(a;) (18.6.8) 

i i 

where Sy is the covariance matrix (equation 18.4.6). If one can neglect off-diagonal 
covariances (as when the errors on the c<’s are independent), then Sk = <5-;, of 
is diagonal. 

We now need to define a measure of the width or spread of 5(x, x') at each 
value of x. While many choices are possible. Backus and Gilbert choose the second 
moment of its square. This measure becomes the functional A, 


A = w(x) = J (x 1 — x) 2 [8(x,x')] 2 dx' 

= ^Z^Z ( u{x)W ij {x)q j {x) = q{x) ■ W(x) • q(x) 


(18.6.9) 


where we have here used equation (18.6.6) and defined the spread matrix W(x) by 
W{j(x) = jix'-xfnixy^W (18.6.10) 


The functions qi{x) are now determined by the minimization principle 

minimize: A + \B = q(x) • [W(x) + AS] • q(a:) (18.6.11) 


subject to the constraint (18.6.7) that q(x) • R = 1. 
The solution of equation (18.6.11) is 


qO) 


[W(ar) + AS]' 1 -R 
R • [W(x) + AS] -1 • R 


(18.6.12) 


(Reference [4] gives an accessible proof.) For any particular data set c (set of 
measurements c,), the solution u(x) is thus 


c- [W(s) + AS]" 1 -R 
R • [W(x) + AS] -1 -R 



(18.6.13) 
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(Don’t let this notation mislead you into inverting the full matrix W(a;) + AS. You 
only need to solve for some y the linear system (W(x) + AS) • y = R, and then 
substitute y into both the numerators and denominators of 18.6.12 or 18.6.13.) 

Equations (18.6.12) and (18.6.13) have a completely different character from 
the linearly regularized solutions to (18.5.7) and (18.5.8). The vectors and matrices in 
(18.6.12) all have size N, the number of measurements. There is no discretization of 
the underlying variable x, so M does not come into play at all. One solves a different 
N x N set of linear equations for each desired value of x. By contrast, in (18.5.8), 
one solves an M x M linear set, but only once. In general, the computational burden 
of repeatedly solving linear systems makes the Backus-Gilbert method unsuitable 
for other than one-dimensional problems. 

How does one choose A within the Backus-Gilbert scheme? As already 
mentioned, you can (in some cases should) make the choice before you see any actual 
data. For a given trial value of A, and for a sequence of x's, use equation (18.6.12) 
to calculate q(x); then use equation (18.6.6) to plot the resolution functions S(x, x') 
as a function of x'. These plots will exhibit the amplitude with which different 
underlying values x' contribute to the point u(x) of your estimate. For the same 
value of A, also plot the function -^/Var[u(x)] using equation (18.6.8). (You need an 
estimate of your measurement covariance matrix for this.) 

As you change A you will see very explicitly the trade-off between resolution 
and stability. Pick the value that meets your needs. You can even choose A to be a 
function of x, A = A (a:), in equations (18.6.12) and (18.6.13), should you desire to 
do so. (This is one benefit of solving a separate set of equations for each x.) For 
the chosen value or values of A, you now have a quantitative understanding of your 
inverse solution procedure. This can prove invaluable if — once you are processing 
real data — you need to judge whether a particular feature, a spike or jump for 
example, is genuine, and/or is actually resolved. The Backus-Gilbert method has 
found particular success among geophysicists, who use it to obtain information about 
the structure of the Earth (e.g., density run with depth) from seismic travel time data. 

CITED REFERENCES AND FURTHER READING: 

Backus, G.E., and Gilbert, F. 1968, Geophysical Journal of the Royal Astronomical Society, 

vol. 16, pp. 169-205. [1] 

Backus, G.E., and Gilbert, F. 1970, Philosophical Transactions of the Royal Society of London 

A, vol. 266, pp. 123-192. [2] 

Parker, R.L. 1977, Annual Review of Earth and Planetary Science, vol. 5, pp. 35-64. [3] 

Loredo, T.J., and Epstein, R.l. 1989, Astrophysical Journal, vol. 336, pp. 896-919. [4] 


18.7 Maximum Entropy Image Restoration 

Above, we commented that the association of certain inversion methodsbreak 
with Bayesian arguments is more historical accident than intellectual imperative. 
Maximum entropy methods, so-called, are notorious in this regard; to summarize 
these methods without some, at least introductory, Bayesian invocations would be 
to serve a steak without the sizzle, or a sundae without the cherry. We should 



S, § g 
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also comment in passing that the connection between maximum entropy inversion 
methods, considered here, and maximum entropy spectral estimation, discussed in 
§13.7, is rather abstract. For practical purposes the two techniques, though both 
named maximum entropy method or MEM, are unrelated. 

Bayes’ Theorem, which follows from the standard axioms of probability, relates 
the conditional probabilities of two events, say A and B: 


Prob(A|B) 


Prob(A) 


Prob(_B|^4) 

Prob(-B) 


Here Prob(A|B) is the probability of A given that B has occurred, and similarly for 
Prob(.B|A), while Prob(A) and Prob(ZJ) are unconditional probabilities. 

“Bayesians” (so-called) adopt a broader interpretation of probabilities than do 
so-called “frequentists.” To a Bayesian, P(A\B) is a measure of the degree of 
plausibility of A (given B) on a scale ranging from zero to one. In this broader view, 
A and B need not be repeatable events; they can be propositions or hypotheses. 
The equations of probability theory then become a set of consistent rules for 
conducting inference [1.2], Since plausibility is itself always conditioned on some, 
perhaps unarticulated, set of assumptions, all Bayesian probabilities are viewed as 
conditional on some collective background information I. 

Suppose H is some hypothesis. Even before there exist any explicit data, 
a Bayesian can assign to H some degree of plausibility Prob(iT|/), called the 
“Bayesian prior.” Now, when some data D i comes along, Bayes theorem tells how 
to reassess the plausibility of H, 


Prob(iT|£)i/) 


Prob(iT|/) 


Prob(.Di|iJ7) 

Prob(£>i|/) 


The factor in the numerator on the right of equation (18.7.2) is calculable as the 
probability of a data set given the hypothesis (compare with “likelihood” in §15.1). 
The denominator, called the “prior predictive probability” of the data, is in this case 
merely a normalization constant which can be calculated by the requirement that 
the probability of all hypotheses should sum to unity. (In other Bayesian contexts, 
the prior predictive probabilities of two qualitatively different models can be used 
to assess their relative plausibility.) 

If some additional data D 2 comes along tomorrow, we can further refine our 
estimate of 77’s probability, as 


Prob(77|£> 2 £>i/) = Prob(iT|.Di/ 


Prob(D2\H Dil) 
’ Prob(-D 2 |.Di/) 


Using the product rule for probabilities, Prob(AS|C) = Prob( A\C) Prob (B\AC), 
we find that equations (18.7.2) and (18.7.3) imply 


Yrob{H\D 2 DiI) = Prob(77|7 


Vxob{D 2 Di\HI) 
’ Prob(D 2J Di| I) 


which shows that we would have gotten the same answer if all the data D\D 2 
had been taken together. 
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From a Bayesian perspective, inverse problems are inference problems [3,4]. The 
underlying parameter set u is a hypothesis whose probability, given the measured 
data values c, and the Bayesian prior Prob(u|/) can be calculated. We might want 
to report a single “best” inverse u, the one that maximizes 

Prob(u|c I) = Prob(c|u/) (18.7.5) 

over all possible choices of u. Bayesian analysis also admits the possibility of 
reporting additional information that characterizes the region of possible u’s with 
high relative probability, the so-called “posterior bubble” in u. 

The calculation of the probability of the data c, given the hypothesis u proceeds 
exactly as in the maximum likelihood method. For Gaussian errors, e.g., it is given by 

Prob(c|u7) = exp(— i% 2 ) Aui A u 2 ■ ■ ■ A u M (18.7.6) 

where % 2 is calculated from u and c using equation (18.4.9), and the Au ;j ’s are 
constant, small ranges of the components of u whose actual magnitude is irrelevant, 
because they do not depend on u (compare equations 15.1.3 and 15.1.4). 

In maximum likelihood estimation we, in effect, chose the prior Prob(u|7) to 
be constant. That was a luxury that we could afford when estimating a small number 
of parameters from a large amount of data. Here, the number of “parameters” 
(components of u) is comparable to or larger than the number of measured values 
(components of c); we need to have a nontrivial prior, Prob(u|7), to resolve the 
degeneracy of the solution. 

In maximum entropy image restoration, that is where entropy comes in. The 
entropy of a physical system in some macroscopic state, usually denoted S, is the 
logarithm of the number of microscopically distinct configurations that all have 
the same macroscopic observables (i.e., consistent with the observed macroscopic 
state). Actually, we will find it useful to denote the negative of the entropy, also 
called the negentropy, by 77 = —S (a notation that goes back to Boltzmann). In 
situations where there is reason to believe that the a priori probabilities of the 
microscopic configurations are all the same (these situations are called ergodic ), then 
the Bayesian prior Prob(u|7) for a macroscopic state with entropy S is proportional 
to exp(iS) or exp(—77). 

MEM uses this concept to assign a prior probability to any given underlying 
function u. For example [5-7], suppose that the measurement of luminance in each 
pixel is quantized to (in some units) an integer value. Let 

M 

( 18 - 7 - 7 ) 

M =i 

be the total number of luminance quanta in the whole image. Then we can base our 
“prior” on the notion that each luminance quantum has an equal a priori chance of 
being in any pixel. (See [8] for a more abstract justification of this idea.) The number 
of ways of getting a particular configuration u is 


U\ 


oc exp 


ln( u^/U) 
n 



(18.7.8) 



Ui!tt 2 ! • • -u M '- 


1 

2 
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Here the left side can be understood as the number of distinct orderings of all 
the luminance quanta, divided by the numbers of equivalent reorderings within 
each pixel, while the right side follows by Stirling’s approximation to the factorial 
function. Taking the negative of the logarithm, and neglecting terms of order log 27 
in the presence of terms of order 27, we get the negentropy 


H{ u) = 53u„ln(u M /CO 

M=1 


(18.7.9) 


From equations (18.7.5), (18.7.6), and (18.7.9) we now seek to maximize 


Prob(u|c) oc exp | --* 2 | exp[-27(u)] 


(18.7.10) 


or, equivalently, 


minimize: - In [Prob(u|c) ] = ^% 2 [u] + H(u) = ^y 2 [u] + ^ u tl In(•«,,/[/) 
z z 

n=i 

(18.7.11) 

This ought to remind you of equation (18.4.11), or equation (18.5.6), or in fact any of 
our previous minimization principles along the lines of A + A23, where A B = H (u) 
is a regularizing operator. Where is A? We need to put it in for exactly the reason 
discussed following equation (18.4.11): Degenerate inversions are likely to be able 
to achieve unrealistically small values of x 2 - We need an adjustable parameter to 
bring x 2 into its expected narrow statistical range of N ± (22V)The discussion at 
the beginning of §18.4 showed that it makes no difference which term we attach the 
A to. For consistency in notation, we absorb a factor 2 into A and put it on the entropy 
term. (Another way to see the necessity of an undetermined A factor is to note that it 
is necessary if our minimization principle is to be invariant under changing the units 
in which u is quantized, e.g., if an 8-bit analog-to-digital converter is replaced by a 
12-bit one.) We can now also put “hats” back to indicate that this is the procedure 
for obtaining our chosen statistical estimator: 


M 

minimize: A + XB = x 2 [u] + AiT(u) = x 2 [u] + A ^ u M ln(w M ) (18.7.12) 

m=i 



(Formally, we might also add a second Lagrange multiplier A'27, to constrain the 3. r- 
total intensity 27 to be constant.) ■§• S- o 

It is not hard to see that the negentropy, 27 (u), is in fact a regularizing operator, ® § 

similar to u • u (equation 18.4.11) or u H u (equation 18.5.6). The following of 
its properties are noteworthy: 

1. When 27 is held constant, 27(u) is minimized for u ;) = 27/217 = constant, so it 
smooths in the sense of trying to achieve a constant solution, similar to equation 
(18.5.4). The fact that the constant solution is a minimum follows from the fact 
that the second derivative of u In u is positive. 
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2. Unlike equation (18.5.4), however, H( u) is local, in the sense that it does not 
difference neighboring pixels. It simply sums some function /, here 

f(u) = ulnu (18.7.13) 

over all pixels; it is invariant, in fact, under a complete scrambling of the pixels 
in an image. This form implies that H (u) is not seriously increased by the 
occurrence of a small number of very bright pixels (point sources) embedded 
in a low-intensity smooth background. 

3. H(u) goes to infinite slope as any one pixel goes to zero. This causes it to 
enforce positivity of the image, without the necessity of additional deterministic 
constraints. 

4. The biggest difference between H( u) and the other regularizing operators that 
we have met is that H(u) is not a quadratic functional of u, so the equations 
obtained by varying equation (18.7.12) are nonlinear. This fact is itself worthy 
of some additional discussion. 

Nonlinear equations are harder to solve than linear equations. For image 
processing, however, the large number of equations usually dictates an iterative 
solution procedure, even for linear equations, so the practical effect of the nonlinearity 
is somewhat mitigated. Below, we will summarize some of the methods that are 
successfully used for MEM inverse problems. 

For some problems, notably the problem in radio-astronomy of image recovery 
from an incomplete set of Fourier coefficients, the superior performance of MEM 
inversion can be, in part, traced to the nonlinearity of 7T(u). One way to see this [5] 
is to consider the limit of perfect measurements a t —* 0. In this case the \ 2 term in 
the minimization principle (18.7.12) gets replaced by a set of constraints, each with 
its own Lagrange multiplier, requiring agreement between model and data; that is, 


minimize: 



+ H( u) 


(18.7.14) 


(cf. equation 18.4.7). Setting the formal derivative with respect to u ;t to zero gives 

J^ = /'(*V) = £ (18-7-15) 

or defining a function G as the inverse function of /', 


Tip = G 



(18.7.16) 


This solution is only formal, since the A/s must be found by requiring that equation 

(18.7.16) satisfy all the constraints built into equation (18.7.14). However, equation 

(18.7.16) does show the crucial fact that if G is linear, then the solution u contains only 
a linear combination of basis functions Rj^ corresponding to actual measurements 
j. This is equivalent to setting unmeasured c/s to zero. Notice that the principal 
solution obtained from equation (18.4.11) in fact has a linear G. 
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In the problem of incomplete Fourier image reconstruction, the typical R :ni 
has the form exp(—27nkj • x M ), where x ; ,. is a two-dimensional vector in the image 
space and k ;i is a two-dimensional wave-vector. If an image contains strong point 
sources, then the effect of setting unmeasured c/s to zero is to produce sidelobe 
ripples throughout the image plane. These ripples can mask any actual extended, 
low-intensity image features lying between the point sources. If, however, the slope 
of G is smaller for small values of its argument, larger for large values, then ripples 
in low-intensity portions of the image are relatively suppressed, while strong point 
sources will be relatively sharpened (“superresolution”). This behavior on the slope 
of G is equivalent to requiring f"'(u ) < 0. For f(u) = u In it, we in fact have 
/'"(«) = -1/u 2 < 0. 

In more picturesque language, the nonlinearity acts to “create” nonzero values 
for the unmeasured Ci s, so as to suppress the low-intensity ripple and sharpen the 
point sources. 

Is MEM Really Magical? 

How unique is the negentropy functional (18.7.9)? Recall that that equation is 
based on the assumption that luminance elements are a priori distributed over the 
pixels uniformly. If we instead had some other preferred a priori image in mind, one 
with pixel intensities m ;i , then it is easy to show that the negentropy becomes 

M 

n («) = E Ufj, ln^fi/rrifj,) + constant (18.7.17) 

i 

(the constant can then be ignored). All the rest of the discussion then goes through. 

More fundamentally, and despite statements by zealots to the contrary [7], there 
is actually nothing universal about the functional form f(u) = u In u. In some 
other physical situations (for example, the entropy of an electromagnetic field in the 
limit of many photons per mode, as in radio-astronomy) the physical negentropy 
functional is actually f(u) = — In u (see [5] for other examples). In general, the 
question, “Entropy of what?” is not uniquely answerable in any particular situation. 
(See reference [9] for an attempt at articulating a more general principle that reduces 
to one or another entropy functional under appropriate circumstances.) 

The four numbered properties summarized above, plus the desirable sign for 
nonlinearity, f"'(u ) < 0, are all as true for f(u) = — In u as for f(u) = u In u. In 
fact these properties are shared by a nonlinear function as simple as f(u ) = — ^/u, 
which has no information theoretic justification at all (no logarithms!). MEM 
reconstructions of test images using any of these entropy forms are virtually 
indistinguishable [5], 

By all available evidence, MEM seems to be neither more nor less than one 
usefully nonlinear version of the general regularization scheme A + XB that we have 
by now considered in many forms. Its peculiarities become strengths when applied 
to the reconstruction from incomplete Fourier data of images that are expected 
to be dominated by very bright point sources, but which also contain interesting 
low-intensity, extended sources. For images of some other character, there is no 
reason to suppose that MEM methods will generally dominate other regularization 
schemes, either ones already known or yet to be invented. 
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Algorithms for MEM 

The goal is to find the vector u that minimizes A + XB where in the notation 
of equations (18.5.5), (18.5.6), and (18.7.13), 

A = |b — A • u| 2 B = £/(«„) (18.7.18) 

A* 

Compared with a “general” minimization problem, we have the advantage that 
we can compute the gradients and the second partial derivative matrices (Hessian 


matrices) explicitly, 



<1 

II 

to 

> 

> 

C> 

1 

A T -b) 

-|^ = [2A 3 '.A]„ 

OU^OUp 

[YB] P = /'(«„) 


d 2 B f , 

„ „ = <W M 

OUpOUp 


It is important to note that while ^4’s second partial derivative matrix cannot be stored 
(its size is the square of the number of pixels), it can be applied to any vector by 
first applying A, then A T . In the case of reconstruction from incomplete Fourier 
data, or in the case of convolution with a translation invariant point spread function, 
these applications will typically involve several FFTs. Likewise, the calculation of 
the gradient V A will involve FFTs in the application of A and A T . 

While some success has been achieved with the classical conjugate gradient 
method (§10.6), it is often found that the nonlinearity in f{u) = ulri'u causes 
problems. Attempted steps that give u with even one negative value must be cut in 
magnitude, sometimes so severely as to slow the solution to a crawl. The underlying 
problem is that the conjugate gradient method develops its information about the 
inverse of the Hessian matrix a bit at a time, while changing its location in the search 
space. When a nonlinear function is quite different from a pure quadratic form, the 
old information becomes obsolete before it gets usefully exploited. 

Skilling and collaborators [6,7,10,11] developed a complicated but highly suc¬ 
cessful scheme, wherein a minimum is repeatedly sought not along a single search 
direction, but in a small- (typically three-) dimensional subspace, spanned by vectors 
that are calculated anew at each landing point. The subspace basis vectors are 
chosen in such a way as to avoid directions leading to negative values. One of the 
most successful choices is the three-dimensional subspace spanned by the vectors 
with components given by 

4 J) = 

4 2 > = u„[YB\ p 

( 3 ) UnJ2p(^ 2 A/dUfj,dup)u p [Yl3\p u p ,^2 p {d 2 A/du IJj du p )u p \YA\p 

Ve p M[vbU 2 ^Y, p u P {\VA] p f 

(18.7.20) 

(In these equations there is no sum over /j.) The form of the e (3) has some justification 
if one views dot products as occurring in a space with the metric g pv = S^/u^, 
chosen to make zero values “far away”; see [6], 
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Within the three-dimensional subspace, the three-component gradient and nine- 
component Hessian matrix are computed by projection from the large space, and 
the minimum in the subspace is estimated by (trivially) solving three simultaneous 
linear equations, as in §10.7, equation (10.7.4). The size of a step Au is required 
to be limited by the inequality 

^(A u^f/u^ < (0.1 to 0.5)U (18.7.21) 

Because the gradient directions VA and V£> are separately available, it is possible 
to combine the minimum search with a simultaneous adjustment of A so as finally to 
satisfy the desired constraint. There are various further tricks employed. 

A less general, but in practice often equally satisfactory, approach is due to 
Cornwell and Evans [12], Here, noting that £Ts Hessian (second partial derivative) 
matrix is diagonal, one asks whether there is a useful diagonal approximation to 
.A’s Hessian, namely 2A T • A. If A ; , denotes the diagonal components of such an 
approximation, then a useful step in u would be 

= ~ a + AVS ) ( 18 - 7 - 22 ) 

A^ + A/"(u M ) 

(again compare equation 10.7.4). Even more extreme, one might seek an approx¬ 
imation with constant diagonal elements, A M = A, so that 

= + <18 - 7 ' 23) 

Since A T • A has something of the nature of a doubly convolved point spread 
function, and since in real cases one often has a point spread function with a sharp 
central peak, even the more extreme of these approximations is often fruitful. One 
starts with a rough estimate of A obtained from the A^’s, e.g., 

A ~(l>,] 2 ) ( 18 - 7 - 24 ) 

An accurate value is not important, since in practice A is adjusted adaptively: If A 
is too large, then equation (18.7.23)’s steps will be too small (that is, larger steps in 
the same direction will produce even greater decrease in A + AS). If A is too small, 
then attempted steps will land in an unfeasible region (negative values of u M ), or will 
result in an increased A+XB. There is an obvious similarity between the adjustment 
of A here and the Levenberg-Marquardt method of §15.5; this should not be too 
surprising, since MEM is closely akin to the problem of nonlinear least-squares 
fitting. Reference [12] also discusses how the value of A + A f"{uy) can be used to 
adjust the Lagrange multiplier A so as to converge to the desired value of x 2 - 

All practical MEM algorithms are found to require on the order of 30 to 50 
iterations to converge. This convergence behavior is not now understood in any 
fundamental way. 

“Bayesian” versus “Historic”Maximum Entropy 

Several more recent developments in maximum entropy image restoration 
go under the rubric “Bayesian” to distinguish them from the previous “historic” 
methods. See [13] for details and references. 
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• Better priors: We already noted that the entropy functional (equation 
18.7.13) is invariant under scrambling all pixels and has no notion of 
smoothness. The so-called “intrinsic correlation function” (ICF) model 
(Ref. [13], where it is called “New MaxEnt”) is similar enough to the 
entropy functional to allow similar algorithms, but it makes the values of 
neighboring pixels correlated, enforcing smoothness. 

• Better estimation of A: Above we chose A to bring y 2 into its expected 
narrow statistical range of N ± (2 N ) J / 2 . This in effect overestimates % 2 , 
however, since some effective number 7 of parameters are being “fitted” in 
doing the reconstruction. A Bayesian approach leads to a self-consistent 
estimate of this 7 and an objectively better choice for A. 
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Chapter 19. Partial Differential 
Equations 

19.0 Introduction 


The numerical treatment of partial differential equations is, by itself, a vast 
subject. Partial differential equations are at the heart of many, if not most, 
computer analyses or simulations of continuous physical systems, such as fluids, 
electromagnetic fields, the human body, and so on. The intent of this chapter is to 
give the briefest possible useful introduction. Ideally, there would be an entire second 
volume of Numerical Recipes dealing with partial differential equations alone. (The 
references [1 -4] provide, of course, available alternatives.) 

In most mathematics books, partial differential equations (PDEs) are classified 
into the three categories, hyperbolic, parabolic, and elliptic, on the basis of their 
characteristics, or curves of information propagation. The prototypical example of 
a hyperbolic equation is the one-dimensional wave equation 


d 2 u 2 ® 2u 
~dt? =V ~dx 2 


(19.0.1) 


where v = constant is the velocity of wave propagation. The prototypical parabolic 
equation is the diffusion equation 


du 

dt 


d_ 

dx 



(19.0.2) 


where D is the diffusion coefficient. The prototypical elliptic equation is the 
Poisson equation 


d 2 u d 2 u 

w + W = Ax ' v) 


(19.0.3) 


where the source term p is given. If the source term is equal to zero, the equation 
is Laplace’s equation. 

From a computational point of view, the classification into these three canonical 
types is not very meaningful — or at least not as important as some other essential 
distinctions. Equations (19.0.1) and (19.0.2) both define initial value or Cauchy 
problems: If information on u (perhaps including time derivative information) is 
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(b) 



Figure 19.0.1. Initial value problem (a) and boundary value problem (b) are contrasted. In (a) initial 
values are given on one “time slice,” and it is desired to advance the solution in time, computing successive 
rows of open dots in the direction shown by the arrows. Boundary conditions at the left and right edges 
of each row (<S>) must also be supplied, but only one row at a time. Only one, or a few, previous rows 
need be maintained in memory. In (b), boundary values are specified around the edge of a grid, and an 
iterative process is employed to find the values of all the internal points (open circles). All grid points 
must be maintained in memory. 

given at some initial time to for all x, then the equations describe how u(x,t) 
propagates itself forward in time. In other words, equations (19.0.1) and (19.0.2) 
describe time evolution. The goal of a numerical code should be to track that time 
evolution with some desired accuracy. 

By contrast, equation (19.0.3) directs us to find a single “static” function u(x, y) 
which satisfies the equation within some (x, y) region of interest, and which — one 
must also specify — has some desired behavior on the boundary of the region of 
interest. These problems are called boundary value problems. In general it is not 
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possible stably to just “integrate in from the boundary” in the same sense that an 
initial value problem can be “integrated forward in time.” Therefore, the goal of a 
numerical code is somehow to converge on the correct solution everywhere at once. 

This, then, is the most important classification from a computational point 
of view: Is the problem at hand an initial value (time evolution) problem? or 
is it a boundary value (static solution) problem? Figure 19.0.1 emphasizes the 
distinction. Notice that while the italicized terminology is standard, the terminology 
in parentheses is a much better description of the dichotomy from a computational 
perspective. The subclassification of initial value problems into parabolic and 
hyperbolic is much less important because (i) many actual problems are of a mixed 
type, and (ii) as we will see, most hyperbolic problems get parabolic pieces mixed 
into them by the time one is discussing practical computational schemes. 

Initial Value Problems 

An initial value problem is defined by answers to the following questions: 

• What are the dependent variables to be propagated forward in time? 

• What is the evolution equation for each variable? Usually the evolution 
equations will all be coupled, with more than one dependent variable 
appearing on the right-hand side of each equation. 

• What is the highest time derivative that occurs in each variable’s evolution 
equation? If possible, this time derivative should be put alone on the 
equation’s left-hand side. Not only the value of a variable, but also the 
value of all its time derivatives — up to the highest one — must be 
specified to define the evolution. 

• What special equations (boundary conditions) govern the evolution in time 
of points on the boundary of the spatial region of interest? Examples: 
Dirichlet conditions specify the values of the boundary points as a function 
of time; Neumann conditions specify the values of the normal gradients on 
the boundary; outgoing-wave boundary conditions are just what they say. 

Sections 19.1-19.3 of this chapter deal with initial value problems of several 
different forms. We make no pretence of completeness, but rather hope to convey a 
certain amount of generalizable information through a few carefully chosen model 
examples. These examples will illustrate an important point: One’s principal 
computational concern must be the stability of the algorithm. Many reasonable- 
looking algorithms for initial value problems just don’t work — they are numerically 
unstable. 



Boundary Value Problems 

The questions that define a boundary value problem are: 

• What are the variables? 

• What equations are satisfied in the interior of the region of interest? 

• What equations are satisfied by points on the boundary of the region of 
interest? (Here Dirichlet and Neumann conditions are possible choices for 
elliptic second-order equations, but more complicated boundary conditions 
can also be encountered.) 
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In contrast to initial value problems, stability is relatively easy to achieve 
for boundary value problems. Thus, the efficiency of the algorithms, both in 
computational load and storage requirements, becomes the principal concern. 

Because all the conditions on a boundary value problem must be satisfied 
“simultaneously,” these problems usually boil down, at least conceptually, to the 
solution of large numbers of simultaneous algebraic equations. When such equations 
are nonlinear, they are usually solved by linearization and iteration; so without much 
loss of generality we can view the problem as being the solution of special, large 
linear sets of equations. 

As an example, one which we will refer to in §§19.4-19.6 as our “model 
problem,” let us consider the solution of equation (19.0.3) by the finite-difference 
method. We represent the function u(x,y ) by its values at the discrete set of points 

Xj = xq — jA, j = 0,1,./ 

(19.0.4) 

yi=yo + iA, Z = 0,1,..., L 


where A is the grid spacing. From now on, we will write Ujj for u(xj,yt), and 
Pjj for p(xj. yi). For (19.0.3) we substitute a finite-difference representation (see 
Figure 19.0.2), 

Uj+1,1 — 2 Ujj + uj i.t Ujj + 1 — 2 Ujj + Uj t i- 1 _ 

A2 + A2 - f 

or equivalently 

Uj + i,i + + u jt i +1 + %*_i - 4 ujj = A 2 p j: i 

To write this system of linear equations in matrix form we need to make a 
vector out of u. Let us number the two dimensions of grid points in a single 
one-dimensional sequence by defining 

i=j(L+l) + l for j = 0,1,..., J, / = (), 1,.... L (19.0.7) 

In other words, i increases most rapidly along the columns representing y values. 
Equation (19.0.6) now becomes 


(19.0.5) 

(19.0.6) 


U i+L + 1 + Ui-( L+1 ) + u i+ 1 + Ui- 1 - 4 Ui = A 2 Pi (19.0.8) 

This equation holds only at the interior points j = 1,2,..., J — 1;Z = 1,2,..., 
L-l. 

The points where 


j = o 

[i.e.,i 

i = 0,...,L] 


j - J 

[i.e„ i 

i = J(L + 1),. 

..., J(L+1) + L\ 

1 = 0 

[i-e-L 

i = 0, L + 1, .. 

;J(L+ 1)] 

l = L 

[i.e., i 

i = L,L+ 1 + 

■L,...,J(L + 1) + L\ 



(19.0.9) 
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Figure 19.0.2. Finite-difference representation of a second-order elliptic equation on a two-dimensional 
grid. The second derivatives at the point A are evaluated using the points to which A is shown connected. 
The second derivatives at point B are evaluated using the connected points and also using “right-hand 
side” boundary information, shown schematically as 

are boundary points where either u or its derivative has been specified. If we pull 
all this “known” information over to the right-hand side of equation (19.0.8), then 
the equation takes the form 

A • u = b (19.0.10) 

where A has the form shown in Figure 19.0.3. The matrix A is called “tridiagonal 
with fringes.” A general linear second-order elliptic equation 


d 2 u du 

a ( x ’y)xZ2+ b ( x ’y)*z 


, . d 2 u „ . du 

■ c{x ’ v) W + (x ' v) Sv 


■ f(x,y)u = g(x,y) 


will lead to a matrix of similar structure except that the nonzero entries will not 
be constants. 

As a rough classification, there are three different approaches to the solution 
of equation (19.0.10), not all applicable in all cases: relaxation methods, “rapid” 
methods (e.g., Fourier methods), and direct matrix methods. 
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Figure 19.0.3. Matrix structure derived from a second-order elliptic equation (here equation 19.0.6). All 
elements not shown are zero. The matrix has diagonal blocks that are themselves tridiagonal, and sub- 
and super-diagonal blocks that are diagonal. This form is called “tridiagonal with fringes.” A matrix this 
sparse would never be stored in its full form as shown here. 

Relaxation methods make immediate use of the structure of the sparse matrix 
A. The matrix is split into two parts 



A = E — F (19.0.12) 

where E is easily invertible and F is the remainder. Then (19.0.10) becomes 

E u = F • u + b (19.0.13) 

The relaxation method involves choosing an initial guess uand then solving 
successively for iterates u <rj from 



E • u (r) = F • u (r ’ 


b 


(19.0.14) 


Since E is chosen to be easily invertible, each iteration is fast. We will discuss 
relaxation methods in some detail in §19.5 and §19.6. 
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So-called rapid methods [5] apply for only a rather special class of equations: 
those with constant coefficients, or, more generally, those that are separable in the 
chosen coordinates. In addition, the boundaries must coincide with coordinate lines. 
This special class of equations is met quite often in practice. We defer detailed 
discussion to §19.4. Note, however, that the multigrid relaxation methods discussed 
in §19.6 can be faster than “rapid” methods. 

Matrix methods attempt to solve the equation 

A • x = b (19.0.15) 

directly. The degree to which this is practical depends very strongly on the exact 
structure of the matrix A for the problem at hand, so our discussion can go no farther 
than a few remarks and references at this point. 

Sparseness of the matrix must be the guiding force. Otherwise the matrix 
problem is prohibitively large. For example, the simplest problem on a 100 x 100 
spatial grid would involve 10000 unknown Uj/s, implying a 10000 x 10000 matrix 
A, containing 10 8 elements! 

As we discussed at the end of §2.7, if A is symmetric and positive definite 
(as it usually is in elliptic problems), the conjugate-gradient algorithm can be 
used. In practice, rounding error often spoils the effectiveness of the conjugate 
gradient algorithm for solving finite-difference equations. However, it is useful when 
incorporated in methods that first rewrite the equations so that A is transformed to a 
matrix A' that is close to the identity matrix. The quadratic surface defined by the 
equations then has almost spherical contours, and the conjugate gradient algorithm 
works very well. In §2.7, in the routine linbcg, an analogous preconditioner 
was exploited for non-positive definite problems with the more general biconjugate 
gradient method. For the positive definite case that arises in PDEs, an example of 
a successful implementation is the incomplete Cholesky conjugate gradient method 
(ICCG) (see [6-8]). 

Another method that relies on a transformation approach is the strongly implicit 
procedure of Stone [9], A program called SIPSOL that implements this routine has 
been published [10], 

A third class of matrix methods is the Analyze-Factorize-Operate approach as 
described in §2.7. 

Generally speaking, when you have the storage available to implement these 
methods — not nearly as much as the 10 8 above, but usually much more than is 
required by relaxation methods — then you should consider doing so. Only multigrid 
relaxation methods (§19.6) are competitive with the best matrix methods. For grids 
larger than, say, 300 x 300, however, it is generally found that only relaxation 
methods, or “rapid” methods when they are applicable, are possible. 

There Is More to Life than Finite Differencing 

Besides finite differencing, there are other methods for solving PDEs. Most 
important are finite element, Monte Carlo, spectral, and variational methods. Unfor¬ 
tunately, we shall barely be able to do justice to finite differencing in this chapter, 
and so shall not be able to discuss these other methods in this book. Finite element 
methods [11-12] are often preferred by practitioners in solid mechanics and structural 
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engineering; these methods allow considerable freedom in putting computational 
elements where you want them, important when dealing with highly irregular geome¬ 
tries. Spectral methods [13-15] are preferred for very regular geometries and smooth 
functions; they converge more rapidly than finite-difference methods (cf. §19.4), but 
they do not work well for problems with discontinuities. 
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19.1 Flux-Conservative Initial Value Problems 


A large class of initial value (time-evolution) PDEs in one space dimension can 
be cast into the form of a flux-conservative equation. 


du _ <9F(u) 

dt dx 


(19.1.1) 


where u and F are vectors, and where (in some cases) F may depend not only on u 
but also on spatial derivatives of u. The vector F is called the conserved flux. 

For example, the prototypical hyperbolic equation, the one-dimensional wave 
equation with constant velocity of propagation v 



d 2 u _ 2 

W = v dfl 2 


(19.1.2) 
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can be rewritten as a set of two first-order equations 


where 


dr _ ds 
dt V dx 
ds dr 
dt V dx 

du 

r = v— 
dx 

du 


(19.1.3) 


(19.1.4) 


In this case r and s become the two components of u, and the flux is given by 
the linear matrix relation 


F «=(-„ 7 )'" 


(The physicist-reader may recognize equations (19.1.3) as analogous to Maxwell’s 
equations for one-dimensional propagation of electromagnetic waves.) 

We will consider, in this section, a prototypical example of the general flux- 
conservative equation (19.1.1), namely the equation for a scalar u, 


du du 
dt V dx 


(19.1.6) 


with v a constant. As it happens, we already know analytically that the general 
solution of this equation is a wave propagating in the positive ^-direction, 


u = f(x — vt) 


(19.1.7) 


where / is an arbitrary function. However, the numerical strategies that we develop 
will be equally applicable to the more general equations represented by (19.1.1). In 
some contexts, equation (19.1.6) is called an advective equation, because the quantity 
u is transported by a “fluid flow” with a velocity v. 

How do we go about finite differencing equation (19.1.6) (or, analogously, 
19.1.1)? The straightforward approach is to choose equally spaced points along both 
the t- and rr-axes. Thus denote 

Xj = xq + jAx, j = 0,1,..., J 

(19.1.8) 

t n = t o + nAt, n = 0,1,..., N 


Let u'l denote u(t n ,Xj). We have several choices for representing the time 
derivative term. The obvious way is to set 


du 

dt 


I j,n 


At 


+ O(At) 


(19.1.9) 



This is called forward Euler differencing (cf. equation 16.1.1). While forward Euler 
is only first-order accurate in At, it has the advantage that one is able to calculate 
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Figure 19.1.1. Representation of the Forward Time Centered Space (FTCS) differencing scheme. In this 
and subsequent figures, the open circle is the new point at which the solution is desired; filled circles are 
known points whose function values are used in calculating the new point; the solid lines connect points 
that are used to calculate spatial derivatives; the dashed lines connect points that are used to calculate time 
derivatives. The FTCS scheme is generally unstable for hyperbolic problems and cannot usually be used. 


quantities at timestep n + 1 in terms of only quantities known at timestep n. For the 
space derivative, we can use a second-order representation still using only quantities 
known at timestep n: 




The resulting finite-difference approximation to equation (19.1.6) is called the FTCS 
representation (Forward Time Centered Space), 


which can easily be rearranged to be a formula for tt" +1 in terms of the other 
quantities. The FTCS scheme is illustrated in Figure 19.1.1. It’s a fine example of 
an algorithm that is easy to derive, takes little storage, and executes quickly. Too 
bad it doesn’t work! (See below.) 

The FTCS representation is an explicit scheme. This means that u" +1 for each 
j can be calculated explicitly from the quantities that are already known. Later we 
shall meet implicit schemes, which require us to solve implicit equations coupling 
the it" +l for various j. (Explicit and implicit methods for ordinary differential 
equations were discussed in §16.6.) The FTCS algorithm is also an example of 
a single-level scheme, since only values at time level n have to be stored to find 
values at time level n + 1. 

von Neumann Stability Analysis 

Unfortunately, equation (19.1.11) is of very limited usefulness. It is an unstable 
method, which can be used only (if at all) to study waves for a short fraction of one 
oscillation period. To find alternative methods with more general applicability, we 
must introduce the von Neumann stability analysis. 

The von Neumann analysis is local: We imagine that the coefficients of the 
difference equations are so slowly varying as to be considered constant in space 
and time. In that case, the independent solutions, or eigenmodes, of the difference 
equations are all of the form 



(19.1.12) 
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x or j 

Figure 19.1.2. Representation of the Lax differencing scheme, as in the previous figure. The stability 
criterion for this scheme is the Courant condition. 

where A; is a real spatial wave number (which can have any value) and l; = £(k) is 
a complex number that depends on k. The key fact is that the time dependence of 
a single eigenmode is nothing more than successive integer powers of the complex 
number £. Therefore, the difference equations are unstable (have exponentially 
growing modes) if \£(k)\ > 1 for some k. The number £ is called the amplification 
factor at a given wave number k. 

To find £(fe), we simply substitute (19.1.12) back into (19.1.11). Dividing 
by f n , we get 


£(k) = 1 — i— —sinfcAx (19.1.13) 

whose modulus is > 1 for all k\ so the FTCS scheme is unconditionally unstable. 

If the velocity v were a function of t and x, then we would write u” in equation 
(19.1.11). In the von Neumann stability analysis we would still treat v as a constant, 
the idea being that for v slowly varying the analysis is local. In fact, even in the 
case of strictly constant v, the von Neumann analysis does not rigorously treat the 
end effects at j = 0 and j = N. 

More generally, if the equation’s right-hand side were nonlinear in u, then a 
von Neumann analysis would linearize by writing u = uo + Su, expanding to linear 
order in 6u. Assuming that the uo quantities already satisfy the difference equation 
exactly, the analysis would look for an unstable eigenmode of Su. 

Despite its lack of rigor, the von Neumann method generally gives valid answers 
and is much easier to apply than more careful methods. We accordingly adopt it 
exclusively. (See, for example, [1 ] for a discussion of other methods of stability 
analysis.) 

Lax Method 



The instability in the FTCS method can be cured by a simple change due to Lax. 
One replaces the term u” in the time derivative term by its average (Figure 19.1.2): 




(19.1.14) 




This turns (19.1.11) into 


1 , „ n \ vAt , „ v 

= 2 K’+i + u i- 1) - 2A^ ^ +1 “ Uj ~ l > 


(19.1.15) 
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Figure 19.1.3. Courant condition for stability of a differencing scheme. The solution of a hyperbolic 
problem at a point depends on information within some domain of dependency to the past, shown here 
shaded. The differencing scheme (19.1.15) has its own domain of dependency determined by the choice 
of points on one time slice (shown as connected solid dots) whose values are used in determining a new 
point (shown connected by dashed lines). A differencing scheme is Courant stable if the differencing 
domain of dependency is larger than that of the PDEs, as in (a), and unstable if the relationship is the 
reverse, as in (b). For more complicated differencing schemes, the domain of dependency might not be 
determined simply by the outermost points. 

Substituting equation (19.1.12), we find for the amplification factor 

£ = cos kAx — sinfcAx (19.1.16) 

Ax 

The stability condition |£| 2 < 1 leads to the requirement 


This is the famous Courant-Friedrichs-Lewy stability criterion, often 
called simply the Courant condition. Intuitively, the stability condition can be 
understood as follows (Figure 19.1.3): The quantity it ” +1 in equation (19.1.15) is 
computed from information at points j — 1 and j + 1 at time n. In other words, 
xj -1 and Xj + \ are the boundaries of the spatial region that is allowed to communicate 
information to ?i” f l . Now recall that in the continuum wave equation, information 
actually propagates with a maximum velocity v. If the point it ” +1 is outside of 
the shaded region in Figure 19.1.3, then it requires information from points more 
distant than the differencing scheme allows. Lack of that information gives rise to 
an instability. Therefore, At cannot be made too large. 

The surprising result, that the simple replacement (19.1.14) stabilizes the FTCS 
scheme, is our first encounter with the fact that differencing PDEs is an art as much 
as a science. To see if we can demystify the art somewhat, let us compare the FTCS 
and Lax schemes by rewriting equation (19.1.15) so that it is in the form of equation 
(19.1.11) with a remainder term: 


u'" 1 - u 

i_ _ 


\+-( 

2w i +u 7-i\ 

At 


V 2Ax ) 

1 M 

At ) 


But this is exactly the FTCS representation of the equation 
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where V 2 = d 2 /dx 2 in one dimension. We have, in effect, added a diffusion term to 
the equation, or, if you recall the form of the Navier-Stokes equation for viscous fluid 
flow, a dissipative term. The Lax scheme is thus said to have numerical dissipation, 
or numerical viscosity. We can see this also in the amplification factor. Unless \v\At 
is exactly equal to Ax, |£| < 1 and the amplitude of the wave decreases spuriously. 

Isn’t a spurious decrease as bad as a spurious increase? No. The scales that we 
hope to study accurately are those that encompass many grid points, so that they have 
kAx <C 1. (The spatial wave number k is defined by equation 19.1.12.) For these 
scales, the amplification factor can be seen to be very close to one, in both the stable 
and unstable schemes. The stable and unstable schemes are therefore about equally 
accurate. For the unstable scheme, however, short scales with kAx ~ 1, which we 
are not interested in, will blow up and swamp the interesting part of the solution. 
Much better to have a stable scheme in which these short wavelengths die away 
innocuously. Both the stable and the unstable schemes are inaccurate for these short 
wavelengths, but the inaccuracy is of a tolerable character when the scheme is stable. 

When the independent variable u is a vector, then the von Neumann analysis 
is slightly more complicated. For example, we can consider equation (19.1.3), 
rewritten as 


dr d vs 
dt |_sj dx[vr 


(19.1.20) 


The Lax method for this equation is 


1 , „ „ . vAt . „ _ . 

-(r j+1 + r^) + 2^(s j+ i - Vi) 

1 . _ _ . vAt, „ „ . 

2 ( s j+i + fj_i) + ^ c ( r i +1 _ r i- 1) 


(19.1.21) 


The von Neumann stability analysis now proceeds by assuming that the eigenmode 
is of the following (vector) form. 


= Ce« 


(19.1.22) 


Here the vector on the right-hand side is a constant (both in space and in time) 
eigenvector, and £ is a complex number, as before. Substituting (19.1.22) into 
(19.1.21), and dividing by the power £ ra , gives the homogeneous vector equation 


(cos kAx) — £ 

vAt . -| 

i —— sm k Ax 
Ax 


V 


y 

vAt 



_s°_ 



i— sin/c Ax 

L Ax 

(cos kAx) — £ 



0 


This admits a solution only if the determinant of the matrix on the left vanishes, a 
condition easily shown to yield the two roots £ 


$ 


cos kAx ± i sin kAx 
Ax 


(19.1.24) 



The stability condition is that both roots satisfy |£| < 1. This again turns out to be 
simply the Courant condition (19.1.17). 
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Other Varieties of Error 


Thus far we have been concerned with amplitude error, because of its intimate 
connection with the stability or instability of a differencing scheme. Other varieties 
of error are relevant when we shift our concern to accuracy, rather than stability. 

Finite-difference schemes for hyperbolic equations can exhibit dispersion, or 
phase errors. For example, equation (19.1.16) can be rewritten as 

£ = e~ ikAx + i sin kAx (19.1.25) 


An arbitrary initial wave packet is a superposition of modes with different fc’s. 
At each timestep the modes get multiplied by different phase factors (19.1.25), 
depending on their value of k. If Af = Ax/v, then the exact solution for each mode 
of a wave packet f(x — vt) is obtained if each mode gets multiplied by exp(-ikAx). 
For this value of At, equation (19.1.25) shows that the finite-difference solution 
gives the exact analytic result. However, if vAt/Ax is not exactly 1, the phase 
relations of the modes can become hopelessly garbled and the wave packet disperses. 
Note from (19.1.25) that the dispersion becomes large as soon as the wavelength 
becomes comparable to the grid spacing Ax. 

A third type of error is one associated with nonlinear hyperbolic equations and 
is therefore sometimes called nonlinear instability. For example, a piece of the Euler 
or Navier-Stokes equations for fluid flow looks like 


dv dv 
di = ~ v d^ + 


(19.1.26) 


The nonlinear term in v can cause a transfer of energy in Fourier space from 
long wavelengths to short wavelengths. This results in a wave profile steepening 
until a vertical profile or “shock” develops. Since the von Neumann analysis 
suggests that the stability can depend on kAx, a scheme that was stable for shallow 
profiles can become unstable for steep profiles. This kind of difficulty arises in 
a differencing scheme where the cascade in Fourier space is halted at the shortest 
wavelength representable on the grid, that is, at k ~ 1/Ax. If energy simply 
accumulates in these modes, it eventually swamps the energy in the long wavelength 
modes of interest. 

Nonlinear instability and shock formation is thus somewhat controlled by 
numerical viscosity such as that discussed in connection with equation (19.1.18) 
above. In some fluid problems, however, shock formation is not merely an annoyance, 
but an actual physical behavior of the fluid whose detailed study is a goal. Then, 
numerical viscosity alone may not be adequate or sufficiently controllable. This is a 
complicated subject which we discuss further in the subsection on fluid dynamics, 
below. 

For wave equations, propagation errors (amplitude or phase) are usually most 
worrisome. For advective equations, on the other hand, transport errors are usually 
of greater concern. In the Lax scheme, equation (19.1.15), a disturbance in the 
advected quantity u at mesh point j propagates to mesh points j + 1 and j — 1 at 
the next timestep. In reality, however, if the velocity v is positive then only mesh 
point j + 1 should be affected. 
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Figure 19.1.4. Representation of upwind differencing schemes. The upper scheme is stable when the 
advection constant v is negative, as shown; the lower scheme is stable when the advection constant v is 
positive, also as shown. The Courant condition must, of course, also be satisfied. 


The simplest way to model the transport properties “better” is to use upwind 
differencing (see Figure 19.1.4): 


At 


M- u M 

Ax 

u 7 +1 ~ u 7 

Ax 


v j > 0 

vj < 0 


(19.1.27) 



Note that this scheme is only first-order, not second-order, accurate in the 
calculation of the spatial derivatives. How can it be “better”? The answer is 
one that annoys the mathematicians: The goal of numerical simulations is not 
always “accuracy” in a strictly mathematical sense, but sometimes “fidelity” to the 
underlying physics in a sense that is looser and more pragmatic. In such contexts, 
some kinds of error are much more tolerable than others. Upwind differencing 
generally adds fidelity to problems where the advected variables are liable to undergo 
sudden changes of state, e.g., as they pass through shocks or other discontinuities. 
You will have to be guided by the specific nature of your own problem. 

For the differencing scheme (19.1.27), the amplification factor (for constant v) is 



£ = 1 — ——- (1 - cosfcAx) — i ——sinfcAx (19.1.28) 

| Ax | Ax 

ifi 2 = 1 “ 2 10 “ |^|) (1 “ c “‘ Al) (19 ' L29) 


S. I | 


So the stability criterion £| 2 < 1 is (again) simply the Courant condition (19.1.17). 

There are various ways of improving the accuracy of first-order upwind 
differencing. In the continuum equation, material originally a distance vAt away 
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Figure 19.1.5. Representation of the staggered leapfrog differencing scheme. Note that information 
from two previous time slices is used in obtaining the desired point. This scheme is second-order 
accurate in both space and time. 


arrives at a given point after a time interval At. In the first-order method, the 
material always arrives from Ax away. If vAt <C Ax (to insure accuracy), this can 
cause a large error. One way of reducing this error is to interpolate u between j — 1 
and j before transporting it. This gives effectively a second-order method. Various 
schemes for second-order upwind differencing are discussed and compared in [2-3], 

Second-Order Accuracy in Time 


When using a method that is first-order accurate in time but second-order 
accurate in space, one generally has to take vAt significantly smaller than Ax to 
achieve desired accuracy, say, by at least a factor of 5. Thus the Courant condition 
is not actually the limiting factor with such schemes in practice. However, there are 
schemes that are second-order accurate in both space and time, and these can often be 
pushed right to their stability limit, with correspondingly smaller computation times. 

For example, the staggered leapfrog method for the conservation equation 
(19.1.1) is defined as follows (Figure 19.1.5): Using the values of u n at time t n , 
compute the fluxes Ft 1 . Then compute new values u n+1 using the time-centered 
values of the fluxes: 


«r - -r 1 =- *?-!> (w-ra®) 

The name comes from the fact that the time levels in the time derivative term 
“leapfrog” over the time levels in the space derivative term. The method requires 
that «" -1 and u n be stored to compute u n+1 . 

For our simple model equation (19.1.6), staggered leapfrog takes the form 


vA t. 

A.r >,J ~ Uj ' 1 


(19.1.31) 



The von Neumann stability analysis now gives a quadratic equation for £, rather than 
a linear one, because of the occurrence of three consecutive powers of £ when the 


imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 




19.1 Flux-Conservative Initial Value Problems 


843 


form (19.1.12) for an eigenmode is substituted into equation (19.1.31), 

£ 2 — 1 = — 2i£^—— sin kAx 
Ax 


whose solution is 


- sin kAx ± \ 1 — 


Thus the Courant condition is again required for stability. In fact, in equation 

(19.1.33), |£| 2 = 1 for any vAt < Ax. This is the great advantage of the staggered 
leapfrog method: There is no amplitude dissipation. 

Staggered leapfrog differencing of equations like (19.1.20) is most transparent 
if the variables are centered on appropriate half-mesh points: 


(19.1.34) 


This is purely a notational convenience: we can think of the mesh on which r and 
s are defined as being twice as fine as the mesh on which the original variable u is 
defined. The leapfrog differencing of equation (19.1.20) is 

r "+l _ r n n+1/2 n+1/2 

r 3 + 1/2 'j+1/ 2 = S j +1 ~ S j 

At Ax / 19 j 

n+1/2 ^1/2 „ _ n 1 J 

s i s j _Ji+m 3-m 


If you substitute equation (19.1.22) in equation (19.1.35), you will find that once 
again the Courant condition is required for stability, and that there is no amplitude 
dissipation when it is satisfied. 

If we substitute equation (19.1.34) in equation (19.1.35), we find that equation 

(19.1.35) is equivalent to 


2 U 1+1- 2U ] + U 1- 
(Ax) 2 


This is just the “usual” second-order differencing of the wave equation (19.1.2). We 
see that it is a two-level scheme, requiring both u n and tt” _1 to obtain u n+1 . In 
equation (19.1.35) this shows up as both s n_1 / 2 and r n being needed to advance 
the solution. 

For equations more complicated than our simple model equation, especially 
nonlinear equations, the leapfrog method usually becomes unstable when the gradi¬ 
ents get large. The instability is related to the fact that odd and even mesh points are 
completely decoupled, like the black and white squares of a chess board, as shown 
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Figure 19.1.6. Origin of mesh-drift instabilities in a staggered leapfrog scheme. If the mesh points 
are imagined to lie in the squares of a chess board, then white squares couple to themselves, black to 
themselves, but there is no coupling between white and black. The fix is to introduce a small diffusive 
mesh-coupling piece. 



in Figure 19.1.6. This mesh drifting instability is cured by coupling the two meshes 
through a numerical viscosity term, e.g., adding to the right side of (19.1.31) a small 
coefficient (<C 1) times ?i” +1 — 2u” + u"_ 1 . For more on stabilizing difference 
schemes by adding numerical dissipation, see, e.g., [4], 

The Two-Step Lax-Wendroff scheme is a second-order in time method that 
avoids large numerical dissipation and mesh drifting. One defines intermediate 
values Uj+ 1/2 at the half timesteps t n+1 / 2 and the half mesh points Xj + 1 / 2 . These 
are calculated by the Lax scheme: 



Using these variables, one calculates the fluxes F™ +1 f 2 . Then the updated values 
u] +1 are calculated by the properly centered expression 


The provisional values are now discarded. (See Figure 19.1.7.) 

Let us investigate the stability of this method for our model advective equation, 
where F = vu. Substitute (19.1.37) in (19.1.38) to get 
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The stability criterion |£| 2 < 1 is therefore a 2 < 1, or vAt < Ax as usual. 
Incidentally, you should not think that the Courant condition is the only stability 
requirement that ever turns up in PDEs. It keeps doing so in our model examples 
just because those examples are so simple in form. The method of analysis is, 
however, general. 

Except when a = 1, |£| 2 < 1 in (19.1.42), so some amplitude damping does 
occur. The effect is relatively small, however, for wavelengths large compared with 
the mesh size Ax. If we expand (19.1.42) for small kAx, we find 






The departure from unity occurs only at fourth order in k. This should be contrasted 
with equation (19.1.16) for the Lax method, which shows that 


|£| 2 = 1 — (1 — a 2 )(kAx) 2 + ... (19.1.44) 


for small kAx. 
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In summary, our recommendation for initial value problems that can be cast in 
flux-conservative form, and especially problems related to the wave equation, is to use 
the staggered leapfrog method when possible. We have personally had better success 
with it than with the Two-Step Lax-Wendroff method. For problems sensitive to 
transport errors, upwind differencing or one of its refinements should be considered. 

Fluid Dynamics with Shocks 

As we alluded to earlier, the treatment of fluid dynamics problems with shocks 
has become a very complicated and very sophisticated subject. All we can attempt 
to do here is to guide you to some starting points in the literature. 

There are basically three important general methods for handling shocks. The 
oldest and simplest method, invented by von Neumann and Richtmyer, is to add 
artificial viscosity to the equations, modeling the way Nature uses real viscosity 
to smooth discontinuities. A good starting point for trying out this method is the 
differencing scheme in § 12.11 of [1 ]. This scheme is excellent for nearly all problems 
in one spatial dimension. 

The second method combines a high-order differencing scheme that is accurate 
for smooth flows with a low order scheme that is very dissipative and can smooth 
the shocks. Typically, various upwind differencing schemes are combined using 
weights chosen to zero the low order scheme unless steep gradients are present, and 
also chosen to enforce various “monotonicity” constraints that prevent nonphysical 
oscillations from appearing in the numerical solution. References [2-3,5] are a good 
place to start with these methods. 

The third, and potentially most powerful method, is Godunov’s approach. Here 
one gives up the simple linearization inherent in finite differencing based on Taylor 
series and includes the nonlinearity of the equations explicitly. There is an analytic 
solution for the evolution of two uniform states of a fluid separated by a discontinuity, 
the Riemann shock problem. Godunov’s idea was to approximate the fluid by a 
large number of cells of uniform states, and piece them together using the Riemann 
solution. There have been many generalizations of Godunov’s approach, of which 
the most powerful is probably the PPM method [6], 

Readable reviews of all these methods, discussing the difficulties arising when 
one-dimensional methods are generalized to multidimensions, are given in [7-9], 
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19.2 Diffusive Initial Value Problems 


Recall the model parabolic equation, the diffusion equation in one space 
dimension, 


du _ d / jjdu\ 
dt dx \ dx) 


(19.2.1) 


where D is the diffusion coefficient. Actually, this equation is a flux-conservative 
equation of the form considered in the previous section, with 


F = —D 


du 

dx 


(19.2.2) 


the flux in the .x-di recti on. We will assume D > 0, otherwise equation (19.2.1) has 
physically unstable solutions: A small disturbance evolves to become more and more 
concentrated instead of dispersing. (Don’t make the mistake of trying to find a stable 
differencing scheme for a problem whose underlying PDEs are themselves unstable!) 

Even though (19.2.1) is of the form already considered, it is useful to consider 
it as a model in its own right. The particular form of flux (19.2.2), and its direct 
generalizations, occur quite frequently in practice. Moreover, we have already seen 
that numerical viscosity and artificial viscosity can introduce diffusive pieces like 
the right-hand side of (19.2.1) in many other situations. 

Consider first the case when D is a constant. Then the equation 


du _ <9 2 u 

~dt =D d^ 


can be differenced in the obvious way: 


(19.2.3) 


u • — U'j 


At 


= D 



— 2 u'j + Uj , _i 

(Ax) 2 _ 


(19.2.4) 


This is the FTCS scheme again, except that it is a second derivative that has been 
differenced on the right-hand side. But this makes a world of difference! The FTCS 
scheme was unstable for the hyperbolic equation; however, a quick calculation shows 
that the amplification factor for equation (19.2.4) is 


£=1- 


ADAt 

(AzF 5 


kAx\ 

—) 


(19.2.5) 


The requirement |£| < 1 leads to the stability criterion 


2DAt 

(A xf 


< 1 


(19.2.6) 
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The physical interpretation of the restriction (19.2.6) is that the maximum 
allowed timestep is, up to a numerical factor, the diffusion time across a cell of 
width Ax. 

More generally, the diffusion time r across a spatial scale of size A is of order 


A 2 

D 


(19.2.7) 


Usually we are interested in modeling accurately the evolution of features with 
spatial scales A Ax. If we are limited to timesteps satisfying (19.2.6), we will 
need to evolve through of order A 2 /(Ax) 2 steps before things start to happen on the 
scale of interest. This number of steps is usually prohibitive. We must therefore 
find a stable way of taking timesteps comparable to, or perhaps — for accuracy — 
somewhat smaller than, the time scale of (19.2.7). 

This goal poses an immediate “philosophical” question. Obviously the large 
timesteps that we propose to take are going to be woefully inaccurate for the small 
scales that we have decided not to be interested in. We want those scales to do 
something stable, “innocuous,” and perhaps not too physically unreasonable. We 
want to build this innocuous behavior into our differencing scheme. What should 
it be? 

There are two different answers, each of which has its pros and cons. The 
first answer is to seek a differencing scheme that drives small-scale features to their 
equilibrium forms, e.g., satisfying equation (19.2.3) with the left-hand side set to 
zero. This answer generally makes the best physical sense; but, as we will see, it leads 
to a differencing scheme (“fully implicit”) that is only first-order accurate in time for 
the scales that we are interested in. The second answer is to let small-scale features 
maintain their initial amplitudes, so that the evolution of the larger-scale features 
of interest takes place superposed with a kind of “frozen in” (though fluctuating) 
background of small-scale stuff. This answer gives a differencing scheme (“Crank- 
Nicolson”) that is second-order accurate in time. Toward the end of an evolution 
calculation, however, one might want to switch over to some steps of the other kind, 
to drive the small-scale stuff into equilibrium. Let us now see where these distinct 
differencing schemes come from: 

Consider the following differencing of (19.2.3), 


u, — u- 


A t 


= D 


(Ax) 2 


(19.2.8) 


This is exactly like the FTCS scheme (19.2.4), except that the spatial derivatives on 
the right-hand side are evaluated at timestep n + 1. Schemes with this character are 
called fully implicit or backward time, by contrast with FTCS (which is called fully 
explicit). To solve equation (19.2.8) one has to solve a set of simultaneous linear 
equations at each timestep for the u " +1 . Fortunately, this is a simple problem because 
the system is tridiagonal: Just group the terms in equation (19.2.8) appropriately: 

-uu^l + (1 + 2a)u] +1 - au^+l = u], j = 1,2... J - 1 (19.2.9) 

where 

DAt 

(Ax) 2 



a = 


(19.2.10) 
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Supplemented by Dirichlet or Neumann boundary conditions at j = 0 and j = J, 
equation (19.2.9) is clearly a tridiagonal system, which can easily be solved at each 
timestep by the method of §2.4. 

What is the behavior of (19.2.8) for very large timesteps? The answer is seen 
most clearly in (19.2.9), in the limit a —> oo (At —*• oo). Dividing by a, we see that 
the difference equations are just the finite-difference form of the equilibrium equation 


d 2 u 
dx 2 


= 0 


(19.2.11) 


What about stability? The amplification factor for equation (19.2.8) is 


£ = 


1 


1 + 4 a sin 2 


kAx\ 

—J 


(19.2.12) 


Clearly |£| < 1 for any stepsize At. The scheme is unconditionally stable. The details 
of the small-scale evolution from the initial conditions are obviously inaccurate for 
large At. But, as advertised, the correct equilibrium solution is obtained. This is 
the characteristic feature of implicit methods. 

Here, on the other hand, is how one gets to the second of our above philosophical 
answers, combining the stability of an implicit method with the accuracy of a method 
that is second-order in both space and time. Simply form the average of the explicit 
and implicit FTCS schemes: 


u j +1 ~ 1 

A T~ 


u 7+i - 2u 7 +1 + u "-i) + K+i 


(Ax) 2 


(19.2.1; 


Here both the left- and right-hand sides are centered at timestep n+ so the method 
is second-order accurate in time as claimed. The amplification factor is 




1 _ W (*^) 

“ “ . o }kAx\ 

1 + 2 a sir ( —— J 


(19.2.14) 


so the method is stable for any size At. This scheme is called the Crank-Nicolson 
scheme, and is our recommended method for any simple diffusion problem (perhaps 
supplemented by a few fully implicit steps at the end). (See Figure 19.2.1.) 

Now turn to some generalizations of the simple diffusion equation (19.2.3). 
Suppose first that the diffusion coefficient D is not constant, say D = D(x). We can 
adopt either of two strategies. First, we can make an analytic change of variable 

f dx 

y = J Wm 



(19.2.15) 
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t or n 


(a) 


x or j 


O 


*■ 


FTCS 


o- 


-9- 


-o 


o 


-9- 


-o 


* I-*-• 

(b) Fully Implicit (c) Crank-Nicolson 

Figure 19.2.1. Three differencing schemes for diffusive problems (shown as in Figure 19.1.2). (a) 
Forward Time Center Space is first-order accurate, but stable only for sufficiently small timesteps. (b) Fully 
Implicit is stable for arbitrarily large timesteps, but is still only first-order accurate, (c) Crank-Nicolson 
is second-order accurate, and is usually stable for large timesteps. 


Then 


du 

d , .du 


dt 


becomes 


du 

1 d-U 


dt = 

D{y) dy 2 


(19.2.16) 

(19.2.17) 


and we evaluate D at the appropriate tj j. Heuristically, the stability criterion (19.2.6) 
in an explicit scheme becomes 


At < min 

j 



(19.2.18) 


Note that constant spacing Ay in y does not imply constant spacing in x. 

An alternative method that does not require analytically tractable forms for 
D is simply to difference equation (19.2.16) as it stands, centering everything 
appropriately. Thus the FTCS method becomes 


M s* l '^M>^ D o+iM u 7+ 1 ~ U 1 ) ~ - p i-i/ 2( m " - u j~i) 

spi m® 


(19.2.19) 


where 



D j+ 1/2 = D(Xj+t/ 2 ) 


(19.2.20) 
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and the heuristic stability criterion is 


At < min 

3 


' {Ax) 2 ' 
2D j+ 1/2. 


(19.2.21) 


The Crank-Nicolson method can be generalized similarly. 

The second complication one can consider is a nonlinear diffusion problem, 
for example where D = D(u). Explicit schemes can be generalized in the obvious 
way. For example, in equation (19.2.19) write 

D j+ 1/2 = \ [D(u] + 1 ) + D(u? j: (19.2.22) 

Implicit schemes are not as easy. The replacement (19.2.22) with n —> n + 1 leaves 
us with a nasty set of coupled nonlinear equations to solve at each timestep. Often 
there is an easier way: If the form of D(u) allows us to integrate 

dz = D(u)du (19.2.23) 

analytically for z(u), then the right-hand side of (19.2.1) becomes d 2 z/dx 2 , which 
we difference implicitly as 



z n+1 - 2 z n+1 + z n+1 
Z 3 +1 £Z ] ~ Z j i. 

(Ax) 2 


(19.2.24) 


Now linearize each term on the right-hand side of equation (19.2.24), for example 

dz I 


’" +1 = z(u]^) = z(u 7) + « +1 - o du J n 
= z(u”) + (u] +1 - u?)D(u?) 


(19.2.25) 


This reduces the problem to tridiagonal form again and in practice usually retains 
the stability advantages of fully implicit differencing. 



Schrodinger Equation 

Sometimes the physical problem being solved imposes constraints on the 
differencing scheme that we have not yet taken into account. For example, consider 
the time-dependent Schrodinger equation of quantum mechanics. This is basically a 
parabolic equation for the evolution of a complex quantity ip. For the scattering of a 
wavepacket by a one-dimensional potential V (x), the equation has the form 





.dip d 2 ip . 

= -^? + rw ' /, 


(19.2.26) 


(Here we have chosen units so that Planck’s constant h = 1 and the particle mass 
m = 1/2.) One is given the initial wavepacket, ip(x, t = 0), together with boundary 
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conditions that ip —> 0 at a; —> ±oo. Suppose we content ourselves with first- 
order accuracy in time, but want to use an implicit scheme, for stability. A slight 
generalization of (19.2.8) leads to 


W 1 -^] 


-^-2^+iP^- 

At 


(A®) 2 


+ V^ 


(19.2.27) 


for which 


£= ■ 


lii 


4A t 
(Axf 


, (kAx\ . ' 

' (“ 2 “) + V i At 


(19.2.28) 


This is unconditionally stable, but unfortunately is not unitary. The underlying 
physical problem requires that the total probability of finding the particle somewhere 
remains unity. This is represented formally by the modulus-square norm of ip 
remaining unity: 


')\ 2 dx = 1 


(19.2.29) 


The initial wave function ip(x, 0) is normalized to satisfy (19.2.29). The Schrodinger 
equation (19.2.26) then guarantees that this condition is satisfied at all later times. 
Let us write equation (19.2.26) in the form 

i^ = HiP (19.2.30) 

where the operator H is 

H =--^ + v ( x ) (19.2.31) 

The formal solution of equation (19.2.30) is 

ip{x, t) = e~ im ip(x, 0) (19.2.32) 

where the exponential of the operator is defined by its power series expansion. 

The unstable explicit FTCS scheme approximates (19.2.32) as 

V>" +1 = (1 - iHAt)ip] (19.2.33) 

where H is represented by a centered finite-difference approximation in x. The 
stable implicit scheme (19.2.27) is, by contrast, 

v>" +1 = (1 + iff At)" 1 V" (19.2.34) 



These are both first-order accurate in time, as can be seen by expanding equation 
(19.2.32). However, neither operator in (19.2.33) or (19.2.34) is unitary. 
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The correct way to difference Schrodinger’s equation [1,2] is to use Cayley’s 
form for the finite-difference representation of e ~ lHt , which is second-order accurate 
and unitary: 


1 - § iff At 
1 + ± iff At 


(19.2.35) 


In other words. 


(1 + ±iHAt)ipJ +1 = (1 - |*ff Af)V>" (19.2.36) 

On replacing ff by its finite-difference approximation in x, we have a complex 
tridiagonal system to solve. The method is stable, unitary, and second-order accurate 
in space and time. In fact, it is simply the Crank-Nicolson method once again! 


CITED REFERENCES AND FURTHER READING: 

Ames, W.F. 1977, Numerical Methods for Partial Differential Equations, 2nd ed. (New York: 
Academic Press), Chapter 2. 

Goldberg, A., Schey, H.M., and Schwartz, J.L. 1967, American Journal of Physics, vol. 35, 
pp. 177-186. [1] 

Galbraith, I., Ching, Y.S., and Abraham, E. 1984, American Journal of Physics, vol. 52, pp. 60- 
68 . [ 2 ] 


19.3 Initial Value Problems in Multidimensions 

The methods described in §19.1 and §19.2 for problems in 1 + 1 dimension 
(one space and one time dimension) can easily be generalized to N + 1 dimensions. 
However, the computing power necessary to solve the resulting equations is enor¬ 
mous. If you have solved a one-dimensional problem with 100 spatial grid points, 
solving the two-dimensional version with 100 x 100 mesh points requires at least 
100 times as much computing. You generally have to be content with very modest 
spatial resolution in multidimensional problems. 

Indulge us in offering a bit of advice about the development and testing of 
multidimensional PDE codes: You should always first run your programs on very 
small grids, e.g., 8x8, even though the resulting accuracy is so poor as to be 
useless. When your program is all debugged and demonstrably stable, then you can 
increase the grid size to a reasonable one and start looking at the results. We have 
actually heard someone protest, “my program would be unstable for a crude grid, 
but I am sure the instability will go away on a larger grid.” That is nonsense of a 
most pernicious sort, evidencing total confusion between accuracy and stability. In 
fact, new instabilities sometimes do show up on larger grids; but old instabilities 
never (in our experience) just go away. 

Forced to live with modest grid sizes, some people recommend going to higher- 
order methods in an attempt to improve accuracy. This is very dangerous. Unless the 
solution you are looking for is known to be smooth, and the high-order method you 
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are using is known to be extremely stable, we do not recommend anything higher 
than second-order in time (for sets of first-order equations). For spatial differencing, 
we recommend the order of the underlying PDEs, perhaps allowing second-order 
spatial differencing for first-order-in-space PDEs. When you increase the order of 
a differencing method to greater than the order of the original PDEs, you introduce 
spurious solutions to the difference equations. This does not create a problem if they 
all happen to decay exponentially; otherwise you are going to see all hell break loose! 

Lax Method for a Flux-Conservative Equation 


As an example, we show how to generalize the Lax method (19.1.15) to two 
dimensions for the conservation equation 


du 

9t 


= -V-F= - 


fdF x dF y 
\ dx dy 


(19.3.1) 


Use a spatial grid with 


Xj = x 0 + j A 

m = yo + iA 


(19.3.2) 



We have chosen Ax = Ay = A for simplicity. Then the Lax scheme is 


n n+1 

u j,i 


4(^+1 ,1 + u j-i,i + u 7 ,i+ 1 + u j,i- 1) 

~ + F U+i ~ F ?,l- 1 ) 


(19.3.3) 


Note that as an abbreviated notation F j+i and Fj -1 refer to F x , while Fi + 1 and 
Fi—i refer to F y . 

Let us carry out a stability analysis for the model advective equation (analog 
of 19.1.6) with 


Fx = v x u, F y = v y u (19.3.4) 

This requires an eigenmode with two dimensions in space, though still only a simple 
dependence on powers of £ in time, 



Substituting in equation (19.3.3), we find 


(19.3.5) 


K § | 


where 


= — (cos k x A + cos k y A) — ia x sin k x A — ia y sin k y A 


OLx = 


v x A t 
A 7 


%Af 

A 


(19.3.6) 

(19.3.7) 
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The expression for |£| 2 can be manipulated into the form 


|£| 2 = 1 - (sin 2 k x A + sin 2 k v A) - - (a 2 + a 2 ) 


— -(cos k x A — cos k y A) 2 — ( a y sin k x A — oi x sin k y A) 2 


(19.3.8 


The last two terms are negative, and so the stability requirement |£| 2 < 1 becomes 

(19.3.9) 


- - (a 2 + a 2 y ) > 0 
A 


At < 


\/2(u 2 + u 2 ) 1 / 2 


(19.3.10) 


This is an example of the general result for the iV-dimensional Courant 
condition: If |v| is the maximum propagation velocity in the problem, then 


At < 


y/N\v 


(19.3.11) 


is the Courant condition. 


Diffusion Equation in Multidimensions 


Let us consider the two-dimensional diffusion equation, 

du _ / d 2 u d 2 u\ 
dt \cte 2 dy 2 ) 


(19.3.12) 


An explicit method, such as FTCS, can be generalized from the one-dimensional 
case in the obvious way. However, we have seen that diffusive problems are usually 
best treated implicitly. Suppose we try to implement the Crank-Nicolson scheme in 
two dimensions. This would give us 


Here 


+1* (W 1 + + 5 >lV + s >h] 


a = 


DAt 

"Wf 


A = Ax = Ay 


(19.3.13) 

(19.3.14) 


s l u h = u 7+i,i ~ 2u h + u j-i,i 


(19.3.15) 


and similarly for 5 2 u” z . This is certainly a viable scheme; the problem arises in 
solving the coupled linear equations. Whereas in one space dimension the system 
was tridiagonal, that is no longer true, though the matrix is still very sparse. One 
possibility is to use a suitable sparse matrix technique (see §2.7 and §19.0). 

Another possibility, which we generally prefer, is a slightly different way of 
generalizing the Crank-Nicolson algorithm. It is still second-order accurate in time 
and space, and unconditionally stable, but the equations are easier to solve than 
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(19.3.13). Called the alternating-direction implicit method (ADI), this embodies the 
powerful concept of operator splitting or time splitting, about which we will say 
more below. Here, the idea is to divide each timestep into two steps of size At/2. 
In each substep, a different dimension is treated implicitly: 


71+1/2 Ji 

U u = u j,r 


v n +1 _ . n+1/2 
U j,l ~~ U j,l 


l a i) 

l°(^T ,2 + 5 >T) 


(19.3.16) 


+ 2 


The advantage of this method is that each substep requires only the solution of a 
simple tridiagonal system. 


Operator Splitting Methods Generally 


The basic idea of operator splitting, which is also called time splitting or the 
method of fractional steps, is this: Suppose you have an initial value equation of 
the form 


(193 ' 17) 

where C, is some operator. While C is not necessarily linear, suppose that it can at 
least be written as a linear sum of m pieces, which act additively on u, 


Cu = £ 1 u+C 2 u+--- + C m u (19.3.18) 

Finally, suppose that for each of the pieces, you already know a differencing scheme 
for updating the variable u from timestep n to timestep n + 1, valid if that piece 
of the operator were the only one on the right-hand side. We will write these 
updatings symbolically as 


u n+1 = Ui(u n , At) 
u n+1 =U 2 (u n , At) 

(19.3.19) 


u n+1 = U m (u n ,At) 

Now, one form of operator splitting would be to get from n to n + 1 by the 
following sequence of updatings: 

u n+(l/ m ) =Ul (u n ,At) 
u n+(2/m) = W 2 ( u "+(l/"0 }At ) 



u n+1 =U m (u n+ ^ m - 1) / m ,At) 


(19.3.20) 
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For example, a combined advective-diffusion equation, such as 


du du d^u 
at = - V &c + D fa? 


(19.3.21) 


might profitably use an explicit scheme for the advective term combined with a 
Crank-Nicolson or other implicit scheme for the diffusion term. 

The alternating-direction implicit (ADI) method, equation (19.3.16), is an 
example of operator splitting with a slightly different twist. Let us reinterpret 
(19.3.19) to have a different meaning: Let U\ now denote an updating method that 
includes algebraically all the pieces of the total operator C, but which is desirably 
stable only for the L\ piece; likewise U-i....U rn . Then a method of getting from 
u n to u n+1 is 


u n+l/m = U 1 ( u n jAt /m) 

u n+2/m = U 2 (u n + 1 / m ,At/ m ) 

(19.3.22) 


u n+1 =U m (u n+{m - 1)/m ,At/m) 

The timestep for each fractional step in (19.3.22) is now only 1 /to of the full timestep, 
because each partial operation acts with all the terms of the original operator. 

Equation (19.3.22) is usually, though not always, stable as a differencing scheme 
for the operator £. In fact, as a rule of thumb, it is often sufficient to have stable U , 's 
only for the operator pieces having the highest number of spatial derivatives — the 
other UiS can be unstable — to make the overall scheme stable! 

It is at this point that we turn our attention from initial value problems to 
boundary value problems. These will occupy us for the remainder of the chapter. 

CITED REFERENCES AND FURTHER READING: 

Ames, W.F. 1977, Numerical Methods for Partial Differential Equations, 2nd ed. (New York: 
Academic Press). 


19.4 Fourier and Cyclic Reduction Methods for 
Boundary Value Problems 

As discussed in §19.0, most boundary value problems (elliptic equations, for 
example) reduce to solving large sparse linear systems of the form 



A • u = 


b 


(19.4.1) 


either once, for boundary value equations that are linear, or iteratively, for boundary 
value equations that are nonlinear. 
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Two important techniques lead to “rapid” solution of equation (19.4.1) when 
the sparse matrix is of certain frequently occurring forms. The Fourier transform 
method is directly applicable when the equations have coefficients that are constant 
in space. The cyclic reduction method is somewhat more general; its applicability 
is related to the question of whether the equations are separable (in the sense of 
“separation of variables”). Both methods require the boundaries to coincide with 
the coordinate lines. Finally, for some problems, there is a powerful combination 
of these two methods called FACR (Fourier Analysis and Cyclic Reduction). We 
now consider each method in turn, using equation (19.0.3), with finite-difference 
representation (19.0.6), as a model example. Generally speaking, the methods in this 
section are faster, when they apply, than the simpler relaxation methods discussed 
in §19.5; but they are not necessarily faster than the more complicated multigrid 
methods discussed in §19.6. 


Fourier Transform Method 


The discrete inverse Fourier transform in both x and y is 

Ujl = Jl EE Mmne“ 2 ” jm/; e“ 2 " in/I (19.4.2) 

m=0 n =0 

This can be computed using the FFT independently in each dimension, or else all at 
once via the routine f ourn of §12.4 or the routine rlft3 of §12.5. Similarly, 


Pit = ^ E E fene- 2 " jm/J e- 2 " in/i (19.4.3) 

m=0 n =0 

If we substitute expressions (19.4.2) and (19.4.3) in our model problem (19.0.6), 
we find 


(e 2 -Kim/J + e -2,rim/J + ^2 nin/L + & -2win/L - 4^ = p mn A 2 

j-. _ 

„ ( 27TTO 27TO 

2 ( cos —-—V cos —--2 

J L/ 


(19.4.4) 

(19.4.5) 


Thus the strategy for solving equation (19.0.6) by FFT techniques is: 

• Compute p mn as the Fourier transform 

Pmn = E E PH (19.4.6) 

j=0 1=0 



• Compute u mn from equation (19.4.5). 
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• Compute Uji by the inverse Fourier transform (19.4.2). 

The above procedure is valid for periodic boundary conditions. In other words, 
the solution satisfies 


u ji = U j+J,i = U j,i+L (19.4.7) 

Next consider a Dirichlet boundary condition u = 0 on the rectangular boundary. 
Instead of the expansion (19.4.2), we now need an expansion in sine waves: 


2 2 


J -1 L-l 


7r jm nln 

n sin —— sin —— 
J Lj 


(19.4.8) 


This satisfies the boundary conditions that u = 0 at j = 0, J and at / = 0, L. If we 
substitute this expansion and the analogous one for pji into equation (19.0.6), we 
find that the solution procedure parallels that for periodic boundary conditions: 

• Compute pm.,, by the sine transform 


J -1 L—l 

Pmn = EE Pji 
i=i i =i 


. 7 rjm . irln 
sin —— sin —— 
J Lj 


(19.4.9) 


(A fast sine transform algorithm was given in §12.3.) 

• Compute u mn from the expression analogous to (19.4.5), 


A 2 Pmn 


/ 7r m 7 in \ 

2( c<s _ + cos t -2) 


(19.4.10) 


• Compute Uji by the inverse sine transform (19.4.8). 

If we have inhomogeneous boundary conditions, for example u = 0 on all 
boundaries except u = f(y) on the boundary x = J A, we have to add to the above 
solution a solution u H of the homogeneous equation 


d 2 u d 2 u 
dx 2 dy 2 


(19.4.11) 


that satisfies the required boundary conditions. In the continuum case, this would 
be an expression of the form 


u 


H 


A n sinh 


mrx . 


LA 


sin 


mry 

LA 


(19.4.12) 


where A n would be found by requiring that u = f(y) at x = J A. In the discrete 
case, we have 



u 


h _ 
A ~ 


2 L ~ X 

— 'y A n sinh 


t mj 


7 ml 
~L 


(19.4.13) 
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If f(y = (A) = fi, then we get A n from the inverse formula 


A n 


1 

sinh ( TvnJ/L ) 


£.-=n 

E 

i=i 


fl sin 


7T nl 

~L 


The complete solution to the problem is 


(19.4.14) 


u = Uji+uf t (19.4.15) 

By adding appropriate terms of the form (19.4.12), we can handle inhomogeneous 
terms on any boundary surface. 

A much simpler procedure for handling inhomogeneous terms is to note that 
whenever boundary terms appear on the left-hand side of (19.0.6), they can be taken 
over to the right-hand side since they are known. The effective source term is 
therefore pji plus a contribution from the boundary terms. To implement this idea 
formally, write the solution as 


u = u' + u B (19.4.16) 

where u' = 0 on the boundary, while u B vanishes everywhere except on the 
boundary. There it takes on the given boundary value. In the above example, the 
only nonzero values of u B would be 

u% = fi (19.4.17) 


The model equation (19.0.3) becomes 

VV = —V 2 u B + p 

or, in finite-difference form, 

u j+l,l + u j -U + u j,l +1 + u j,l -1 - 4 u j,l = 

- ( u f+i,i + + u f,i+i + u f,i -1 - Au h) + A 2 /??,/ 


(19.4.18) 


(19.4.19) 


All the u B terms in equation (19.4.19) vanish except when the equation is evaluated 
at j = J — 1, where 

u 'j,i + u 'j-2,i + u 'j-i,i+i + *4-14-1 - =* (19.4.20) 

Thus the problem is now equivalent to the case of zero boundary conditions, except 
that one row of the source term is modified by the replacement 

A 2 pj-i t i -> A 2 pj_ u - (19.4.21) 

The case of Neumann boundary conditions Vu = 0 is handled by the cosine 
expansion (12.3.17): 



2 2 ^^— v // 

u » = jiE E 

m=0 n=0 


irjm 

—J— 


cos 


irln 

~L 


(19.4.22) 
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Here the double prime notation means that the terms for m = 0 and m= J should 
be multiplied by and similarly for n = 0 and n — L. Inhomogeneous terms 
Vu = g can be again included by adding a suitable solution of the homogeneous 
equation, or more simply by taking boundary terms over to the right-hand side. For 
example, the condition 


becomes 


du 

dx 


= g(y ) 


at x = 0 


(19.4.23) 


ui,i - n-i,; 

2A “ 91 


(19.4.24) 


where gi = g(y = IA). Once again we write the solution in the form (19.4.16), 
where now Vw' = 0 on the boundary. This time Vw 11 takes on the prescribed 
value on the boundary, but u B vanishes everywhere except just outside the boundary. 
Thus equation (19.4.24) gives 


= —2 A gi (19.4.25) 

All the u B terms in equation (19.4.19) vanish except when j = 0: 

u i,i + u -i,1 + u o,;+i + u o,i-i ~ t = 2Agi + A 2 p 0 ,z (19.4.26) 

Thus u' is the solution of a zero-gradient problem, with the source term modified 
by the replacement 


AV, - A 2 p 0 j + 2A 9l (19.4.27) 

Sometimes Neumann boundary conditions are handled by using a staggered 
grid, with the it’s defined midway between zone boundaries so that first derivatives 
are centered on the mesh points. You can solve such problems using similar 
techniques to those described above if you use the alternative form of the cosine 
transform, equation (12.3.23). 

Cyclic Reduction 

Evidently the FFT method works only when the original PDE has constant 
coefficients, and boundaries that coincide with the coordinate lines. An alternative 
algorithm, which can be used on somewhat more general equations, is called cyclic 
reduction (CR). 

We illustrate cyclic reduction on the equation 

- ly2 - b (y) Uy - <y) u = .9(*. v) (19.4.28) 

This form arises very often in practice from the Helmholtz or Poisson equations in 
polar, cylindrical, or spherical coordinate systems. More general separable equations 
are treated in [1 ]. 
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The finite-difference form of equation (19.4.28) can be written as a set of 
vector equations 

u j-$ + T • Uj + u j+ i = gjA 2 (19.4.29) 

Here the index j comes from differencing in the ^-direction, while the y-differencing 
(denoted by the index l previously) has been left in vector form. The matrix T 
has the form 


T = B - 21 (19.4.30) 

where the 21 comes from the ^-differencing and the matrix B from the y-differencing. 
The matrix B, and hence T, is tridiagonal with variable coefficients. 

The CR method is derived by writing down three successive equations like 
(19.4.29): 

Uj-2 + T • Uj—i + Uj = gj_i A 2 

Uj-! + T • Uj + Uj +1 = gjA 2 (19.4.31) 

Uj - T Uj +1 + Uj +2 = gj +1 a 2 

Matrix-multiplying the middle equation by — T and then adding the three equations, 
we get 

Uj —2 + T (1) • Uj + Uj + 2 = gj 1} A 2 
This is an equation of the same form as (19.4.29), with 
T (1) = 21 - T 2 

gf ) = A 2 (g i T.g.-gj.i) 

After one level of CR, we have reduced the number of equations by a factor of 
two. Since the resulting equations are of the same form as the original equation, we 
can repeat the process. Taking the number of mesh points to be a power of 2 for 
simplicity, we finally end up with a single equation for the central line of variables: 

T (/) • Uj /2 = A 2 g ^ } 2 - u 0 u 7 (19.4.34) 

Here we have moved uo and uj to the right-hand side because they are known 
boundary values. Equation (19.4.34) can be solved for uj / 2 by the standard 
tridiagonal algorithm. The two equations at level / — 1 involve u j /4 and u 3 ,//,]. The 
equation for u j /4 involves Uo and Uj/ 2 , both of which are known, and hence can be 
solved by the usual tridiagonal routine. A similar result holds true at every stage, 
so we end up solving J — 1 tridiagonal systems. 

In practice, equations (19.4.33) should be rewritten to avoid numerical instabil¬ 
ity. For these and other practical details, refer to [2], 


(19.4.32) 


(19.4.33) 
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FACR Method 

The best way to solve equations of the form (19.4.28), including the constant 
coefficient problem (19.0.3), is a combination of Fourier analysis and cyclic reduction, 
the FACR method [3-6], If at the rth stage of CR we Fourier analyze the equations of 
the form (19.4.32) along y, that is, with respect to the suppressed vector index, we 
will have a tridiagonal system in the ^-direction for each y-Fourier mode: 

u)_r + A £■>«* + u k m = A 2 gp k (19.4.35) 

Here is the eigenvalue of T (r) corresponding to the /cth Fourier mode. For 
the equation (19.0.3), equation (19.4.5) shows that A ^ will involve terms like 
cos(2nk/L ) — 2 raised to a power. Solve the tridiagonal systems for u k at the levels 
j = 2 r , 2 x 2 r , 4 x 2 r ,..., J — 2 r . Fourier synthesize to get the y-values on these 
.X-lines. Then fill in the intermediate x-lines as in the original CR algorithm. 

The trick is to choose the number of levels of CR so as to minimize the total 
number of arithmetic operations. One can show that for a typical case of a 128 x 128 
mesh, the optimal level is r = 2; asymptotically, r —> log 2 (log 2 J). 

A rough estimate of running times for these algorithms for equation (19.0.3) 
is as follows: The FFT method (in both x and y) and the CR method are roughly 
comparable. FACR with r = 0 (that is, FFT in one dimension and solve the 
tridiagonal equations by the usual algorithm in the other dimension) gives about a 
factor of two gain in speed. The optimal FACR with r = 2 gives another factor 
of two gain in speed. 
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19.5 Relaxation Methods for Boundary Value 
Problems 


As we mentioned in §19.0, relaxation methods involve splitting the sparse 
matrix that arises from finite differencing and then iterating until a solution is found. 

There is another way of thinking about relaxation methods that is somewhat 
more physical. Suppose we wish to solve the elliptic equation 



Cu= p 


(19.5.1) 
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where £ represents some elliptic operator and p is the source term. Rewrite the 
equation as a diffusion equation, 

^ = Cu-p (19.5.2) 

An initial distribution u relaxes to an equilibrium solution as t —> oo. This 
equilibrium has all time derivatives vanishing. Therefore it is the solution of the 
original elliptic problem (19.5.1). We see that all the machinery of § 19.2, on diffusive 
initial value equations, can be brought to bear on the solution of boundary value 
problems by relaxation methods. 

Let us apply this idea to our model problem (19.0.3). The diffusion equation is 


du 

dt 


d 2 u d 2 u 
dx 2 "I" dy 2 


(19.5.3) 


If we use FTCS differencing (cf. equation 19.2.4), we get 


n A * / „ 

= U i,l + £2 («j- 


i + + u h+i + u h-i - * u h) ~ Pi,i At (19.5.4) 


Recall from (19.2.6) that FTCS differencing is stable in one spatial dimension only if 
At/ A 2 < In two dimensions this becomes At/ A 2 < |. Suppose we try to take 
the largest possible timestep, and set At = A 2 /4. Then equation (19.5.4) becomes 


,n +1 


[K+i j + K-u + u \ 


A 2 




(19.5.5) 


Thus the algorithm consists of using the average of u at its four nearest-neighbor 
points on the grid (plus the contribution from the source). This procedure is then 
iterated until convergence. 

This method is in fact a classical method with origins dating back to the 
last century, called Jacobi’s method (not to be confused with the Jacobi method 
for eigenvalues). The method is not practical because it converges too slowly. 
However, it is the basis for understanding the modern methods, which are always 
compared with it. 

Another classical method is the Gauss-Seidel method, which turns out to be 
important in multigrid methods (§19.6). Here we make use of updated values of u on 
the right-hand side of (19.5.5) as soon as they become available. In other words, the 
averaging is done “in place” instead of being “copied” from an earlier timestep to a 
later one. If we are proceeding along the rows, incrementing j for fixed l, we have 


r 


l+i,i + u 7-ii + u h+i + <l-i 


A 2 


(19.5.6) 


This method is also slowly converging and only of theoretical interest when used by 
itself, but some analysis of it will be instructive. 

Let us look at the Jacobi and Gauss-Seidel methods in terms of the matrix 
splitting concept. We change notation and call u “x,” to conform to standard matrix 
notation. To solve 



A x = b 


(19.5.7) 
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we can consider splitting A as 


A = L + D + U (19.5.8) 

where D is the diagonal part of A, L is the lower triangle of A with zeros on the 
diagonal, and U is the upper triangle of A with zeros on the diagonal. 

In the Jacobi method we write for the rth step of iteration 

D-x« = —(L + U) •x( r - 1) +b (19.5.9) 

For our model problem (19.5.5), D is simply the identity matrix. The Jacobi method 
converges for matrices A that are “diagonally dominant” in a sense that can be 
made mathematically precise. For matrices arising from finite differencing, this 
condition is usually met. 

What is the rate of convergence of the Jacobi method? A detailed analysis is 
beyond our scope, but here is some of the flavor: The matrix — D 1 • (L + U) is 
the iteration matrix which, apart from an additive term, maps one set of x’s into the 
next. The iteration matrix has eigenvalues, each one of which reflects the factor by 
which the amplitude of a particular eigenmode of undesired residual is suppressed 
during one iteration. Evidently those factors had better all have modulus < 1 for 
the relaxation to work at all! The rate of convergence of the method is set by the 
rate for the slowest-decaying eigenmode, i.e., the factor with largest modulus. The 
modulus of this largest factor, therefore lying between 0 and 1 , is called the spectral 
radius of the relaxation operator, denoted p s . 

The number of iterations r required to reduce the overall error by a factor 
10 -p is thus estimated by 


pin 10 

(-lnp s ) 


(19.5.10) 


In general, the spectral radius p s goes asymptotically to the value 1 as the grid 
size J is increased, so that more iterations are required. For any given equation, 
grid geometry, and boundary condition, the spectral radius can, in principle, be 
computed analytically. For example, for equation (19.5.5) on a J x J grid with 
Dirichlet boundary conditions on all four sides, the asymptotic formula for large 
J turns out to be 


Ps ~ 1 2 J 2 


(19.5.11) 

The number of iterations r required to reduce the error by a factor of 10 ~ p is thus 
2pJ 2 lnlO 1 t2 

- ~-pJ 2 (19.5.12) 



In other words, the number of iterations is proportional to the number of mesh points, 
J 2 . Since 100 x 100 and larger problems are common, it is clear that the Jacobi 
method is only of theoretical interest. 
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The Gauss-Seidel method, equation (19.5.6), corresponds to the matrix de¬ 
composition 

(L+D). X M = -U-x (r - 1) +b (19.5.13) 

The fact that L is on the left-hand side of the equation follows from the updating 
in place, as you can easily check if you write out (19.5.13) in components. One 
can show [1 -3] that the spectral radius is just the square of the spectral radius of the 
Jacobi method. For our model problem, therefore, 


.2 




The factor of two improvement in the number of iterations over the Jacobi method 
still leaves the method impractical. 

Successive Overrelaxation (SOR) 

We get a better algorithm — one that was the standard algorithm until the 1970s 
— if we make an overcorrection to the value of x (r> at the rth stage of Gauss-Seidel 
iteration, thus anticipating future corrections. Solve (19.5.13) for x (V) , add and 
subtract x^ -1 ) on the right-hand side, and hence write the Gauss-Seidel method as 

X M = X (r-1) _ (L + D)- 1 • [(L + D + U) • x^- 1 ) - b] (19.5.16) 

The term in square brackets is just the residual vector so 

X M = x (r- X) _ (L + D )-1 . £(>--1) (19.5.17) 

Now overcorrect, defining 

x (r) _ x (rJ|^ W ( L + D)" 1 • #fft (19.5.18) 


Here to is called the overrelaxation parameter, and the method is called successive 
overretaxation (SOR). 

The following theorems can be proved [1-3]: 

• The method is convergent only for 0 < w < 2. If 0 < u < 1, we speak 
of underrelaxation. 

• Under certain mathematical restrictions generally satisfied by matrices 
arising from finite differencing, only overrelaxation (1 < u < 2 ) can give 
faster convergence than the Gauss-Seidel method. 

• If PJacobi is the spectral radius of the Jacobi iteration (so that the square 
of it is the spectral radius of the Gauss-Seidel iteration), then the optimal 
choice for ui is given by 
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• For this optimal choice, the spectral radius for SOR is 


Psor = 


1 + V 1 - Pjacobi 


(19.5.20) 


As an application of the above results, consider our model problem for which 
P Jacobi is given by equation (19.5.11). Then equations (19.5.19) and (19.5.20) give 


1 + 7r/J 


PSOR - 1 ~ 2 J for lar £ e J 


(19.5.21) 

(19.5.22) 


Equation (19.5.10) gives for the number of iterations to reduce the initial error by 
a factor of 10' p . 


pJlnlO 1 
r ~ 27r ~ 3^ 


(19.5.23) 


Comparing with equation (19.5.12) or (19.5.15), we see that optimal SOR requires 
of order J iterations, as opposed to of order J 2 . Since J is typically 100 or larger, 
this makes a tremendous difference! Equation (19.5.23) leads to the mnemonic 
that 3-figure accuracy (p = 3) requires a number of iterations equal to the number 
of mesh points along a side of the grid. For 6-figure accuracy, we require about 
twice as many iterations. 

How do we choose u) for a problem for which the answer is not known 
analytically? That is just the weak point of SOR! The advantages of SOR obtain 
only in a fairly narrow window around the correct value of w. It is better to take w 
slightly too large, rather than slightly too small, but best to get it right. 

One way to choose cu is to map your problem approximately onto a known 
problem, replacing the coefficients in the equation by average values. Note, however, 
that the known problem must have the same grid size and boundary conditions as the 
actual problem. We give for reference purposes the value of p j aC obi for our model 
problem on a rectangular J x L grid, allowing for the possibility that Ax ^ Ay: 


PJacobi — 


7T 

cos — + 




2 


7T 

L 


(19.5.24) 


Equation (19.5.24) holds for homogeneous Dirichlet or Neumann boundary condi¬ 
tions. For periodic boundary conditions, make the replacement 7r —» 27T. 

A second way, which is especially useful if you plan to solve many similar 
elliptic equations each time with slightly different coefficients, is to determine the 
optimum value u> empirically on the first equation and then use that value for the 
remaining equations. Various automated schemes for doing this and for “seeking 
out” the best values of to are described in the literature. 

While the matrix notation introduced earlier is useful for theoretical analyses, 
for practical implementation of the SOR algorithm we need explicit formulas. 
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Consider a general second-order elliptic equation in x and y, finite differenced on 
a square as for our model equation. Corresponding to each row of the matrix A 
is an equation of the form 

aj,iu j+1 j + bjjUj-xj + c jt iu jt i + 1 + dj/Ujj-i + ejjUjj = f jt i (19.5.25) 

For our model equation, we had a = b = c = d= 1, e = —4. The quantity 
/ is proportional to the source term. The iterative procedure is defined by solving 
(19.5.25) for Uj.p 

u *j,l = — ( fj,l - a j,l u j+l,l - bj/Uj -!,{ - c jt iu jt i +1 - djyUjy-x) (19.5.26) 
&j,i 

Then u,™ w is a weighted average 

+ (1 - u;)u$ l (19.5.27) 

We calculate it as follows: The residual at any stage is 
4$ = &j,iUj+i,i + bjjUj-ij + Cj.iUj.i~.\ 4- dj,tUj,i-i + e j,iUjy — fj t i (19.5.28) 
and the SOR algorithm (19.5.18) or (19.5.27) is 

= (19.5.29) 

e j,i 

This formulation is very easy to program, and the norm of the residual vector £ j.; 
can be used as a criterion for terminating the iteration. 

Another practical point concerns the order in which mesh points are processed. 
The obvious strategy is simply to proceed in order down the rows (or columns). 
Alternatively, suppose we divide the mesh into “odd” and “even” meshes, like the 
red and black squares of a checkerboard. Then equation (19.5.26) shows that the 
odd points depend only on the even mesh values and vice versa. Accordingly, 
we can carry out one half-sweep updating the odd points, say, and then another 
half-sweep updating the even points with the new odd values. For the version of 
SOR implemented below, we shall adopt odd-even ordering. 

The last practical point is that in practice the asymptotic rate of convergence 
in SOR is not attained until of order J iterations. The error often grows by a 
factor of 20 before convergence sets in. A trivial modification to SOR resolves this 
problem. It is based on the observation that, while ui is the optimum asymptotic 
relaxation parameter, it is not necessarily a good initial choice. In SOR with 
Chebyshev acceleration, one uses odd-even ordering and changes u> at each half¬ 
sweep according to the following prescription: 



: V(1 - Pjacobi/2) 


«;<"+1/ 2 ) = l/d - / 


n = 1/2,1,..., oo 
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The beauty of Chebyshev acceleration is that the norm of the error always decreases 
with each iteration. (This is the norm of the actual error in Ujj. The norm of 
the residual need not decrease monotonically.) While the asymptotic rate of 
convergence is the same as ordinary SOR, there is never any excuse for not using 
Chebyshev acceleration to reduce the total number of iterations required. 


Here we give a routine for SOR with Chebyshev acceleration. 


#include <math.h> 

#define MAXITS 1000 
#define EPS 1.0e-5 

void sor(double **a, double **b, double **c, double **d, double **e, 
double **f, double **u, int jmax, double rjac) 

Successive overrelaxation solution of equation (19.5.25) with Chebyshev acceleration, a, b, c, 
d, e, and f are input as the coefficients of the equation, each dimensioned to the grid size 
[1. . jmax] [1. . jmax] . u is input as the initial guess to the solution, usually zero, and returns 
with the final value, rjac is input as the spectral radius of the Jacobi iteration, or an estimate 
of it. 

| 

void nrerror(char error_text[]); 

int ipass,j,jsw,l,lsw,n; 

double anorm,anormf=0.0,omega=l.0,resid; 

Double precision is a good idea for jmax bigger than about 25. 

for (j=2;j<jmax;j++) 

Compute initial norm of residual and terminate iteration when norm has been reduced by 
a factor EPS. 

for (1=2;1<jmax;1++) 

anormf += f abs (f [j] [1]); Assumes initial u is zero, 

for (n=l;n<=MAXITS;n++) { 
anorm=0.0; 
jsw=l; 

for (ipass=l;ipass<=2;ipass++) { Odd-even ordering. 

lsw=jsw; 

for (j=2;j<jmax;j++) { 

for (l=lsw+l; Kjmax; l+=2) { 
resid=a[j] [l]*u[j+l] [1] 

+b[j] [l]*u[j-l] [1] 

+c[j] [l]*u[j] [1+1] 

+d[j] [l]*u[j] [1-1] 

+e[j] [l]*u[j] [1] 

-f [j] [l]; 

anorm += fabs(resid); 

u[j] [1] -= omega*resid/e[j] [1] ; 

> 

lsw=3-lsw; 

> 

jsw=3-jsw; 

omega=(n == 1 && ipass == 1 ? 1.0/(1.0-0.5*rjac*rjac) : 

1.0/(1.0-0.25*rj ac*r j ac*omega)); 

> 

if (anorm < EPS*anormf) return; 

> 

nrerror("MAXITS exceeded"); 

> 



The main advantage of SOR is that it is very easy to program. Its main 
disadvantage is that it is still very inefficient on large problems. 
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ADI (Alternating-Direction Implicit) Method 

The ADI method of § 19.3 for diffusion equations can be turned into a relaxation 
method for elliptic equations [1-4], In §19.3, we discussed ADI as a method for 
solving the time-dependent heat-flow equation 

^ = V 2 n-p (19.5.31) 

By letting t —> oo one also gets an iterative method for solving the elliptic equation 

V 2 u = p (19.5.32) 



In either case, the operator splitting is of the form 

C = C x + C y (19.5.33) 

where C x represents the differencing in x and C y that in y. 

For example, in our model problem (19.0.6) with Ax = Ay — A, we have 


C X U = 2 Ujj - U j+ 1,; - Uj-l,t 
/Lyii = Uj.f i 


(19.5.34) 


More complicated operators may be similarly split, but there is some art involved. 
A bad choice of splitting can lead to an algorithm that fails to converge. Usually 
one tries to base the splitting on the physical nature of the problem. We know for 
our model problem that an initial transient diffuses away, and we set up the x and 
y splitting to mimic diffusion in each dimension. 

Having chosen a splitting, we difference the time-dependent equation (19.5.31) 
implicitly in two half-steps: 



(cf. equation 19.3.16). Here we have suppressed the spatial indices ( j, l ). In matrix 
notation, equations (19.5.35) are 

(L x + rl) • u” +1/2 = (rl - L„) • u" - A 2 p (19.5.36) 

(L v + rl) • u” +1 = (rl - L x ) • u" +1 / 2 - A 2 p (19.5.37) 

where 

2A 2 

r = — (19.5.38) 

The matrices on the left-hand sides of equations (19.5.36) and (19.5.37) are 
tridiagonal (and usually positive definite), so the equations can be solved by the 
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standard tridiagonal algorithm. Given u", one solves (19.5.36) for u n+1 / 2 , substitutes 
on the right-hand side of (19.5.37), and then solves for u ra+1 . The key question 
is how to choose the iteration parameter r, the analog of a choice of timestep for 
an initial value problem. 

As usual, the goal is to minimize the spectral radius of the iteration matrix. 
Although it is beyond our scope to go into details here, it turns out that, for the 
optimal choice of r, the ADI method has the same rate of convergence as SOR. 
The individual iteration steps in the ADI method are much more complicated than 
in SOR, so the ADI method would appear to be inferior. This is in fact true if we 
choose the same parameter r for every iteration step. However, it is possible to 
choose a different r for each step. If this is done optimally, then ADI is generally 
more efficient than SOR. We refer you to the literature [1 -4] for details. 

Our reason for not fully implementing ADI here is that, in most applications, 
it has been superseded by the multigrid methods described in the next section. Our 
advice is to use SOR for trivial problems (e.g., 20 x 20), or for solving a larger 
problem once only, where ease of programming outweighs expense of computer 
time. Occasionally, the sparse matrix methods of §2.7 are useful for solving a set 
of difference equations directly. For production solution of large elliptic problems, 
however, multigrid is now almost always the method of choice. 


CITED REFERENCES AND FURTHER READING: 

Hockney, R.W., and Eastwood, J.W. 1981, Computer Simulation Using Particles (New York: 
McGraw-Hill), Chapter 6. 

Young, D.M. 1971, Iterative Solution of Large Linear Systems (New York: Academic Press). [1 ] 
Stoer, J., and Bulirsch, R. 1980, Introduction to Numerical Analysis (New York: Springer-Verlag), 
§§8.3-8.6. [2] 

Varga, R.S. 1962, Matrix Iterative Analysis (Englewood Cliffs, NJ: Prentice-Hall). [3] 

Spanier, J. 1967, in Mathematical Methods for Digital Computers, Volume 2 (New York: Wiley), 
Chapter 11. [4] 


19.6 Multigrid Methods for Boundary Value 
Problems 


Practical multigrid methods were first introduced in the 1970s by Brandt. These 
methods can solve elliptic PDEs discretized on N grid points in O(N) operations. 
The “rapid” direct elliptic solvers discussed in §19.4 solve special kinds of elliptic 
equations in 0(N log N) operations. The numerical coefficients in these estimates 
are such that multigrid methods are comparable to the rapid methods in execution 
speed. Unlike the rapid methods, however, the multigrid methods can solve general 
elliptic equations with nonconstant coefficients with hardly any loss in efficiency. 
Even nonlinear equations can be solved with comparable speed. 

Unfortunately there is not a single multigrid algorithm that solves all elliptic 
problems. Rather there is a multigrid technique that provides the framework for 
solving these problems. You have to adjust the various components of the algorithm 
within this framework to solve your specific problem. We can only give a brief 
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introduction to the subject here. In particular, we will give two sample multigrid 
routines, one linear and one nonlinear. By following these prototypes and by 
perusing the references [1 -4], you should be able to develop routines to solve your 
own problems. 

There are two related, but distinct, approaches to the use of multigrid techniques. 
The first, termed “the multigrid method,” is a means for speeding up the convergence 
of a traditional relaxation method, as defined by you on a grid of pre-specified 
fineness. In this case, you need define your problem (e.g., evaluate its source terms) 
only on this grid. Other, coarser, grids defined by the method can be viewed as 
temporary computational adjuncts. 

The second approach, termed (perhaps confusingly) “the full multigrid (FMG) 
method,” requires you to be able to define your problem on grids of various sizes 
(generally by discretizing the same underlying PDE into different-sized sets of finite- 
difference equations). In this approach, the method obtains successive solutions on 
finer and finer grids. You can stop the solution either at a pre-specified fineness, or 
you can monitor the truncation error due to the discretization, quitting only when 
it is tolerably small. 

In this section we will first discuss the “multigrid method,” then use the concepts 
developed to introduce the FMG method. The latter algorithm is the one that we 
implement in the accompanying programs. 

From One-Grid, through Two-Grid, to Multigrid 

The key idea of the multigrid method can be understood by considering the 
simplest case of a two-grid method. Suppose we are trying to solve the linear 
elliptic problem 


Cu=f (19.6.1) 

where £ is some linear elliptic operator and / is the source term. Discretize equation 
(19.6.1) on a uniform grid with mesh size h. Write the resulting set of linear 
algebraic equations as 


C h u h = f h (19.6.2) 

Let Uh denote some approximate solution to equation (19.6.2). We will use the 
symbol Uh to denote the exact solution to the difference equations (19.6.2). Then 
the error in Uh or the correction is 


Vh = u h - u h (19.6.3) 

The residual or defect is 

d h = C h u h - f h (19.6.4) 


(Beware: some authors define residual as minus the defect, and there is not universal 
agreement about which of these two quantities 19.6.4 defines.) Since £ h is linear, 
the error satisfies 



£hVh =-dh 


(19.6.5) 
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At this point we need to make an approximation to C h in order to find Vh- The 
classical iteration methods, such as Jacobi or Gauss-Seidel, do this by finding, at 
each stage, an approximate solution of the equation 


Ch% — ~ d h (19.6.6) 

where Ch is a “simpler” operator than C /,. For example, Ch is the diagonal part of 
Ch for Jacobi iteration, or the lower triangle for Gauss-Seidel iteration. The next 
approximation is generated by 


ul ew = u h + v h (19.6.7) 

Now consider, as an alternative, a completely different type of approximation 
for Ch, one in which we “coarsify” rather than “simplify.” That is, we form some 
appropriate approximation Ch of Ch on a coarser grid with mesh size H (we will 
always take H = 2 h, but other choices are possible). The residual equation (19.6.5) 
is now approximated by 


C H v H = ~d H (19.6.8) 

Since Ch has smaller dimension, this equation will be easier to solve than equation 
(19.6.5). To define the defect dn on the coarse grid, we need a restriction operator 
1Z that restricts dh to the coarse grid: 


d H = Hdh (19.6.9) 

The restriction operator is also called the fine-to-coarse operator or the injection 
operator. Once we have a solution vh to equation (19.6.8), we need a prolongation 
operator V that prolongates or interpolates the correction to the fine grid: 


v h =Vv H (19.6.10) 

The prolongation operator is also called the coarse-to-fine operator or the inter¬ 
polation operator. Both TZ and V are chosen to be linear operators. Finally the 
approximation Uh can be updated: 

ul ew = u h + v h (19.6.11) 

One step of this coarse-grid correction scheme is thus: 

Coarse-Grid Correction 

• Compute the defect on the fine grid from (19.6.4). 

• Restrict the defect by (19.6.9). 

• Solve (19.6.8) exactly on the coarse grid for the correction. 

• Interpolate the correction to the fine grid by (19.6.10). 
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• Compute the next approximation by (19.6.11). 

Let’s contrast the advantages and disadvantages of relaxation and the coarse-grid 
correction scheme. Consider the error Vh expanded into a discrete Fourier series. Call 
the components in the lower half of the frequency spectrum the smooth components 
and the high-frequency components the nonsmooth components. We have seen that 
relaxation becomes very slowly convergent in the limit h —> 0 , i.e., when there are a 
large number of mesh points. The reason turns out to be that the smooth components 
are only slightly reduced in amplitude on each iteration. However, many relaxation 
methods reduce the amplitude of the nonsmooth components by large factors on 
each iteration: They are good smoothing operators. 

For the two-grid iteration, on the other hand, components of the error with 
wavelengths ^ 2 H are not even representable on the coarse grid and so cannot be 
reduced to zero on this grid. But it is exactly these high-frequency components that 
can be reduced by relaxation on the fine grid! This leads us to combine the ideas 
of relaxation and coarse-grid correction: 

Two-Grid Iteration 

• Pre-smoothing: Compute Uh by applying u% > 0 steps of a relaxation 
method to «/,. 

• Coarse-grid correction: As above, using Uh to give . 

• Post-smoothing: Compute -uj‘ ew by applying v-i > 0 steps of the relaxation 
method to w,)| ew . 

It is only a short step from the above two-grid method to a multigrid method. 
Instead of solving the coarse-grid defect equation (19.6.8) exactly, we can get an 
approximate solution of it by introducing an even coarser grid and using the two-grid 
iteration method. If the convergence factor of the two-grid method is small enough, 
we will need only a few steps of this iteration to get a good enough approximate 
solution. We denote the number of such iterations by 7 . Obviously we can apply 
this idea recursively down to some coarsest grid. There the solution is found 
easily, for example by direct matrix inversion or by iterating the relaxation scheme 
to convergence. 

One iteration of a multigrid method, from finest grid to coarser grids and back 
to finest grid again, is called a cycle. The exact structure of a cycle depends on the 
value of 7 , the number of two-grid iterations at each intermediate stage. The case 
7 = 1 is called a V-cycle, while 7 = 2 is called a W-cycle (see Figure 19.6.1). These 
are the most important cases in practice. 

Note that once more than two grids are involved, the pre-smoothing steps after 
the first one on the finest grid need an initial approximation for the error v. This 
should be taken to be zero. 

Smoothing, Restriction, and Prolongation Operators 

The most popular smoothing method, and the one you should try first, is 
Gauss-Seidel, since it usually leads to a good convergence rate. If we order the mesh 
points from 1 to N, then the Gauss-Seidel scheme is 



N (19.6.12) 



1 


i = 1 
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2 -grid 




Figure 19.6.1. Structure of multigrid cycles. S denotes smoothing, while E denotes exact solution 
on the coarsest grid. Each descending line \ denotes restriction (TV) and each ascending line / denotes 
prolongation (' P ). The finest grid is at the top level of each diagram. For the V-cycles (7 = 1) the E 
step is replaced by one 2-grid iteration each time the number of grid levels is increased by one. For the 
W-cycles (7 = 2), each E step gets replaced by two 2-grid iterations. 


where new values of u are used on the right-hand side as they become available. The 
exact form of the Gauss-Seidel method depends on the ordering chosen for the mesh 
points. For typical second-order elliptic equations like our model problem equation 
(19.0.3), as differenced in equation (19.0.8), it is usually best to use red-black 
ordering, making one pass through the mesh updating the “even” points (like the red 
squares of a checkerboard) and another pass updating the “odd” points (the black 
squares). When quantities are more strongly coupled along one dimension than 
another, one should relax a whole line along that dimension simultaneously. Line 
relaxation for nearest-neighbor coupling involves solving a tridiagonal system, and 
so is still efficient. Relaxing odd and even lines on successive passes is called zebra 
relaxation and is usually preferred over simple line relaxation. 

Note that SOR should not be used as a smoothing operator. The overrelaxation 
destroys the high-frequency smoothing that is so crucial for the multigrid method. 

A succint notation for the prolongation and restriction operators is to give their 
symbol. The symbol of V is found by considering v h to be 1 at some mesh point 
(x, y), zero elsewhere, and then asking for the values of Vvh■ The most popular 
prolongation operator is simple bilinear interpolation. It gives nonzero values at the 
9 points (x, y), (x + h, y),..., (x — h, y — h), where the values are 1, 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 




876 


Chapter 19. Partial Differential Equations 


Its symbol is therefore 


5 1 § (19.6.13) 

ill 

4 2 4 . 

The symbol of TZ is defined by considering v h to be defined everywhere on the 
fine grid, and then asking what is TZvh at (x, y) as a linear combination of these 
values. The simplest possible choice for TZ is straight injection, which means simply 
filling each coarse-grid point with the value from the corresponding fine-grid point. 
Its symbol is “[1].” However, difficulties can arise in practice with this choice. It 
turns out that a safe choice for TZ is to make it the adjoint operator to V. To define the 
adjoint, define the scalar product of two grid functions u h and Vh for mesh size h as 


(u h \v h )h = h 2 ^2 u hi x T y)vh(x , y) (19.6.14) 

Then the adjoint of V, denoted is defined by 

(«/, \P^v h )n = ( Vu H \v h )h (19.6.15) 

Now take V to be bilinear interpolation, and choose uh = 1 at (x, y), zero elsewhere. 
Set Pt = tz i n (19.6.15) and H = 2 h. You will find that 

= \v h {x,y) + \v h {x + h,y)+ Y§v h {x + h,y+h)-\ - (19.6.16) 


so that the symbol of TZ is 

j_ i j_ - 

16 8 16 

I 1 I (19-6.17) 

_JL l X 

16 8 16 . 

Note the simple rule: The symbol of TZ is \ the transpose of the matrix defining the 
symbol ofP, equation (19.6.13). This rule is general whenever TZ = V ' and H = 2 h. 

The particular choice of 72. in (19.6.17) is called full weighting. Another popular 
choice for TZ is half weighting, “halfway” between full weighting and straight 
injection. Its symbol is 



(19.6.18) 


A similar notation can be used to describe the difference operator £ /,. For 
example, the standard differencing of the model problem, equation (19.0.6), is 
represented by the five-point difference star 
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If you are confronted with a new problem and you are not sure what V and TZ 
choices are likely to work well, here is a safe rule: Suppose m p is the order of the 
interpolation V (i.e., it interpolates polynomials of degree rri p — 1 exactly). Suppose 
m r is the order of TZ, and that 7Z is the adjoint of some V (not necessarily the V you 
intend to use). Then if m is the order of the differential operator £ , you should 
satisfy the inequality m p + m r > m. For example, bilinear interpolation and its 
adjoint, full weighting, for Poisson’s equation satisfy m p + m r = 4 > m = 2. 

Of course the V and TZ operators should enforce the boundary conditions for 
your problem. The easiest way to do this is to rewrite the difference equation to 
have homogeneous boundary conditions by modifying the source term if necessary 
(cf. §19.4). Enforcing homogeneous boundary conditions simply requires the V 
operator to produce zeros at the appropriate boundary points. The corresponding 
TZ is then found by 7 Z = VK 

Full Multigrid Algorithm 

So far we have described multigrid as an iterative scheme, where one starts 
with some initial guess on the finest grid and carries out enough cycles (V-cycles, 
W-cycles,...) to achieve convergence. This is the simplest way to use multigrid: 
Simply apply enough cycles until some appropriate convergence criterion is met. 
However, efficiency can be improved by using the Full Multigrid Algorithm (FMG), 
also known as nested iteration. 

Instead of starting with an arbitrary approximation on the finest grid (e.g., 
Uh = 0), the first approximation is obtained by interpolating from a coarse-grid 
solution: 


u h = Vu H (19.6.20) 

The coarse-grid solution itself is found by a similar FMG process from even coarser 
grids. At the coarsest level, you start with the exact solution. Rather than proceed as 
in Figure 19.6.1, then, FMG gets to its solution by a series of increasingly tall “N’s,” 
each taller one probing a finer grid (see Figure 19.6.2). 

Note that V in (19.6.20) need not be the same V used in the multigrid cycles. 
It should be at least of the same order as the discretization £^, but sometimes a 
higher-order operator leads to greater efficiency. 

It turns out that you usually need one or at most two multigrid cycles at each 
level before proceeding down to the next finer grid. While there is theoretical 
guidance on the required number of cycles (e.g., [2]), you can easily determine it 
empirically. Fix the finest level and study the solution values as you increase the 
number of cycles per level. The asymptotic value of the solution is the exact solution 
of the difference equations. The difference between this exact solution and the 
solution for a small number of cycles is the iteration error. Now fix the number of 
cycles to be large, and vary the number of levels, i.e., the smallest value of h used. In 
this way you can estimate the truncation error for a given h. In your final production 
code, there is no point in using more cycles than you need to get the iteration error 
down to the size of the truncation error. 

The simple multigrid iteration (cycle) needs the right-hand side / only at the 
finest level. FMG needs / at all levels. If the boundary conditions are homogeneous, 
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you can use fn = IZfh- This prescription is not always safe for inhomogeneous 
boundary conditions. In that case it is better to discretize / on each coarse grid. 

Note that the FMG algorithm produces the solution on all levels. It can therefore 
be combined with techniques like Richardson extrapolation. 

We now give a routine mglin that implements the Full Multigrid Algorithm 
for a linear equation, the model problem (19.0.6). It uses red-black Gauss-Seidel as 
the smoothing operator, bilinear interpolation for V , and half-weighting for 1Z. To 
change the routine to handle another linear problem, all you need do is modify the 
functions relax, resid, and slvsml appropriately. A feature of the routine is the 
dynamical allocation of storage for variables defined on the various grids. 



#include "nrutil.h" 

#def ine NPRE 1 Number of relaxation sweeps before ... 

#define NPOST 1 ... and after the coarse-grid correction is com- 

#define NGMAX 15 puted. 

void mglin(double **u, int n, int ncycle) 

Full Multigrid Algorithm for solution of linear elliptic equation, here the model problem (19.0.6). 
On input u[l. .n] [1. .n] contains the right-hand side p, while on output it returns the solution. 
The dimension n must be of the form 21 + 1 for some integer j. (j is actually the number of 
grid levels used in the solution, called ng below.) ncycle is the number of V-cycles to be 
used at each level. 

{ 

void addint(double **uf, double **uc, double **res, int nf); 

void copy(double **aout, double **ain, int n); 

void fillO(double **u, int n); 

void interp(double **uf, double **uc, int nf); 

void relax(double **u, double **rhs, int n); 

void resid(double **res, double **u, double **rhs, int n); 

void rstrct(double **uc, double **uf, int nc); 

void slvsml(double **u, double **rhs); 
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unsigned int j,jcycle,jj,jpost,jpre,nf,ng=0,ngrid,nn; 

double **ires[NGMAX+1],**irho[NGMAX+l],**irhs[NGMAX+1],**iu[NGMAX+1]; 


while (nn »= 1) ng++; 

if (n != 1+(1L « ng)) nrerror("n-l must be a power of 2 in mglin."); 
if (ng > NGMAX) nrerror("increase NGMAX in mglin."); 
nn=n/2+l; 
ngrid=ng-l; 

irho[ngrid]=dmatrix(l,nn,l,nn) ; Allocate storage for r.h.s. on grid ng — 1, 

rstrct(irho[ngrid] ,u,nn) ; and fill it by restricting from the fine grid, 

while (nn > 3) { Similarly allocate storage and fill r.h.s. on all 

nn=nn/2+l; coarse grids, 

irho[—ngrid]=dmatrix(1,nn,1,nn); 
rstrct(irho[ngrid],irho[ngrid+1],nn); 

> 

nn=3; 

iu[l]=dmatrix(l,nn,l,nn); 
irhs[l]=dmatrix(l,nn,l,nn); 

slvsml(iu[l] ,irho[l]); Initial solution on coarsest grid. 

free_dmatrix(irho[1],l,nn,l,nn); 

ngrid=ng; 

for (j=2;j<=ngrid; j++) { Nested iteration loop. 

nn=2*nn-l; 

iu[j]=dmatrix(l,nn,l,nn); 
irhs[j]=dmatrix(l,nn,1,nn); 
ires[j]=dmatrix(l,nn,1,nn); 
interp(iu[j] ,iu[j-l] ,nn); 

Interpolate from coarse grid to next finer grid. 

copy (irhs [j] , (j != ngrid ? irho[j] : u),nn); Set up r.h.s. 
for (jcycle=l;jcycle<=ncycle;jcycle++) { V-cycle loop. 


for (j j=j ; j j>=2; j j—) { Downward stoke of the V. 

for (jpre=l;jpre<=NPRE;jpre++) Pre-smoothing. 

relax(iu[jj] ,irhs[jj] ,nf) ; 
resid(ires [jj] ,iu[jj] ,irhs[jj] ,nf); 
nf=nf/2+1; 

rstrct(irhs[jj-1] ,ires[jj] ,nf) ; 

Restriction of the residual is the next r.h.s. 

f illO(iu[j j-1] ,nf); Zero for initial guess in next 

> relaxation. 

slvsml(iu[l] ,irhs [1]); Bottom of V: solve on coars- 

nf=3; est grid, 

for (j j=2; j j<=j ; j j++) { Upward stroke of V. 



addint(iu[j j] ,iu[jj-l] ,ires[jj] ,nf) ; 

Use res for temporary storage inside addint. 

for (jpost=l; jpost<=NPOST; jpost++) Post-smoothing. 

relax(iu[jj] ,irhs[jj] ,nf) ; 

> 

> 

> 

copy(u,iu[ngrid] ,n); Return solution in u. 

for (nn=n,j=ng;j>=2;j—,nn=nn/2+l) { 
free_dmatrix(ires[j],1,nn,1,nn); 
free_dmatrix(irhs[j],l,nn,l,nn); 
free_dmatrix(iu[j],1,nn,1,nn); 
if (j != ng) free_dmatrix(irho[j],1,nn,1,nn); 

> 

free_dmatrix(irhs[1],1,3,1,3); 
free_dmatrix(iu[l],1,3,1,3); 
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void rstrct(double **uc, double **uf, int nc) 

Half-weighting restriction, nc is the coarse-grid dimension. The fine-grid solution is input in 
uf [1 . . 2*nc-l] [1. . 2*nc-l] , the coarse-grid solution is returned in uc [1 . .nc] [1 . .nc] . 

{ 

int ic.iif,jc,jf,ncc=2*nc-l; 

for (jf=3,jc=2;]c<nc;jc++,jf+=2) { Interior points, 

for (iif=3,ic=2;ic<nc;ic++,iif+=2) { 

uc[ic] [jc]=0.5*uf [iif] [jf]+0.125*(uf [iif+1] [jf]+uf [iif-1] [jf] 

+uf [iif] [jf+l]+uf [iif] [jf-1] ) ; 

> 

> 

for (jc=l,ic=l;ic<=nc;ic++,jc+=2) { Boundary points, 

uc [ic] [l]=uf [jc] [1] ; 
uc[ic] [nc]=uf [jc] [ncc] ; 

> 

for (jc=l,ic=l;ic<=nc;ic++,jc+=2) { 
uc[l] [ic] =uf [1] [jc] ; 
uc[nc] [ic]=uf [ncc] [jc] ; 

> 


void interp(double **uf, double **uc, int nf) 

Coarse-to-fine prolongation by bilinear interpolation, nf is the fine-grid dimension. The coarse- 
grid solution is input as uc [1 . .nc] [1. .nc] , where nc = nf/2 + 1. The fine-grid solution is 
returned in uf [1 . .nf] [1. .nf] . 

{ 

int ic,iif,jc,jf,nc; 
nc=nf/2+1; 

for (jc=l, jf=1; jc<=nc;jc++,jf+=2) Do elements that are copies. 

for (ic=l;ic<=nc;ic++) uf [2*ic-l] [jf] =uc[ic] [jc] ; 
for (jf=1; jf<=nf;jf+=2) Do odd-numbered columns, interpolat- 

for (iif=2;iif<nf; iif+=2) ing vertically, 

uf [iif] [jf] =0.5* (uf [iif+1] [jf ] +uf [iif-1] [jf ] ) ; 

for (jf=2; jf<nf;jf+=2) Do even-numbered columns, interpolat- 

for (iif=l;iif <= nf;iif++) ing horizontally, 

uf [iif] [jf] =0.5* (uf [iif] [jf+1] +uf [iif] [jf-1] ) ; 

> 


void addint(double **uf, double **uc, double **res, int nf) 

Does coarse-to-fine interpolation and adds result to uf. nf is the fine-grid dimension. The 
coarse-grid solution is input as uc [1. .nc] [1. .nc] , where nc = nf/2 + 1. The fine-grid solu¬ 
tion is returned in uf [1. .nf] [1. .nf] . res [1. .nf] [1. .nf] is used for temporary storage. 
{ 

void interp(double **uf, double **uc, int nf); 
int i, j ; 

interp(res,uc,nf); 
for (j=l;j<=nf;j++) 

for (i=l;i<=nf;i++) 

uf[i][j] += res [i] [j] ; 

> 


void slvsml(double **u, double **rhs) 

Solution of the model problem on the coarsest grid, where h= The right-hand side is input 
in rhs [1. .3] [1. .3] and the solution is returned in u[l. . 3] [1. . 3] . 

{ 

void fillO(double **u, int n); 
double h=0.5; 



fill0(u,3); 
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u[2][2] = -h*h*rhs[2] [2]/4.0; 

> 


void relax(double **u, double **rhs, int n) 

Red-black Gauss-Seidel relaxation for model problem. Updates the current value of the solution 
u[l . .n] [1. .n] , using the right-hand side function rhs [1 . .n] [1. .n] . 

{ 

int i.ipass.isw,j,jsw=l; 
double h,h2; 


> 


h=l.0/(n-1); 
h2=h*h; 

for (ipass=l;ipass<=2;ipass++, jsw=3-jsw) { Red and black sweeps. 


> 


isw=jsw; 

for (j=2;j<n;j++,isw=3-isw) 

for (i=isw+l; i<n; i+=2) Gauss-Seidel formula. 

u[i] [j] =0.25*(u[i+1] [j]+u[i-l] [j]+u[i] [j+1] 

+u[i] [ j -1] -h2*rhs [i] [j]); 


void residfdouble **res, double **u, double **rhs, int n) 

Returns minus the residual for the model problem. Input quantities are u[l. .n] [1. .n] and 
rhs [1. .n] [1. .n] , while res [1 . .n] [1. .n] is returned. 

{ 

int i,j; 
double h,h2i; 

h=l.0/(n-l); 
h2i=l.0/(h*h); 

for (j=2;j<n;j++) Interior points, 

for (i=2;i<n;i++) 

res[i] [j] = -h2i*(u[i+l] [j]+u[i-l] [j]+u[i] [j+l]+u[i] [j-1]- 
4.0*u[i] [j] )+rhs [i] [j] ; 
for (i=l;i<=n;i++) Boundary points. 

res [i] [1] =res [i] [n] =res [1] [i] =res [n] [i] =0.0; 

> 


void copy(double **aout, double **ain, int n) 
Copies ain[l. .n] [1. .n] to aout [1. .n] [1. .n] . 
{ 

int i, j ; 

for (i=l;i<=n;i++) 

for (j=l;j<=n;j++) 

aout [j] [i]=ain[j] [i] ; 


> 


void fillO(double **u, int n) 
Fills u[l. .n] [1. .n] with zeros. 

i 


> 


int i,j ; 

for (j=l;j<=n;j++) 

for (i=l;i<=n;i++) 
u[i] [j]=0.0; 



The routine mglin is written for clarity, not maximum efficiency, so that it is 
easy to modify. Several simple changes will speed up the execution time: 
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• The defect dh vanishes identically at all black mesh points after a red-black 
Gauss-Seidel step. Thus d H = 1Zdh for half-weighting reduces to simply 
copying half the defect from the fine grid to the corresponding coarse-grid 
point. The calls to resid followed by rstrct in the first part of the 
V-cycle can be replaced by a routine that loops only over the coarse grid, 
filling it with half the defect. 

• Similarly, the quantity «^ ew = Uh + Vvh need not be computed at red 
mesh points, since they will immediately be redefined in the subsequent 
Gauss-Seidel sweep. This means that addint need only loop over black 
points. 

• You can speed up relax in several ways. First, you can have a special 
form when the initial guess is zero, and omit the routine f illO. Next, 
you can store h 2 fh on the various grids and save a multiplication. Finally, 
it is possible to save an addition in the Gauss-Seidel formula by rewriting 
it with intermediate variables. 

• On typical problems, mglin with ncycle = 1 will return a solution with 
the iteration error bigger than the truncation error for the given size of h. 

To knock the error down to the size of the truncation error, you have to 
set ncycle = 2 or, more cheaply, npre = 2. A more efficient way turns 
out to be to use a higher-order V in (19.6.20) than the linear interpolation 
used in the V-cycle. 

Implementing all the above features typically gives up to a factor of two 
improvement in execution time and is certainly worthwhile in a production code. 

Nonlinear Multigrid: The FAS Algorithm 

Now turn to solving a nonlinear elliptic equation, which we write symbolically as 

C(u) = 0 (19.6.21) 

Any explicit source term has been moved to the left-hand side. Suppose equation (19.6.21) 
is suitably discretized: 

4W = ? (19.6.22) 

We will see below that in the multigrid algorithm we will have to consider equations where a 
nonzero right-hand side is generated during the course of the solution: 

Chin^ffmfh (19.6.23) 

One way of solving nonlinear problems with multigrid is to use Newton’s method, which 
produces linear equations for the correction term at each iteration. We can then use linear 
multigrid to solve these equations. A great strength of the multigrid idea, however, is that it 
can be applied directly to nonlinear problems. All we need is a suitable nonlinear relaxation 
method to smooth the errors, plus a procedure for approximating corrections on coarser grids. 
This direct approach is Brandt’s Full Approximation Storage Algorithm (FAS). No nonlinear 
equations need be solved, except perhaps on the coarsest grid. 

To develop the nonlinear algorithm, suppose we have a relaxation procedure that can 
smooth the residual vector as we did in the linear case. Then we can seek a smooth correction 
Vh to solve (19.6.23): 

C h (u h + v h ) = f h (19.6.24) 

To find Vh, note that 

£h{uh + v h ) - C h {u h ) = f h ~ £h{uh) 



= ~dh 


(19.6.25) 
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The right-hand side is smooth after a few nonlinear relaxation sweeps. Thus we can transfer 
the left-hand side to a coarse grid: 

Ch{uh) — CH{TZuh) = —Tldh (19.6.26) 

that is, we solve 

Ch{uh) = C H (TZu h ) - Tld h (19.6.27) 

on the coarse grid. (This is how nonzero right-hand sides appear.) Suppose the approximate 
solution is uh. Then the coarse-grid correction is 

vh =uh — 'R.uh (19.6.28) 

and 

u£ ew = Uh +V(uh — TZuh) (19.6.29) 

Note that VIZ ^ 1 in general, so u)| ew Vuh . This is a key point: In equation (19.6.29) the 
interpolation error comes only from the correction, not from the full solution uh- 

Equation (19.6.27) shows that one is solving for the full approximation uh, not just the 
error as in the linear algorithm. This is the origin of the name FAS. 

The FAS multigrid algorithm thus looks very similar to the linear multigrid algorithm. 
The only differences are that both the defect dh and the relaxed approximation uh have to 
be restricted to the coarse grid, where now it is equation (19.6.27) that is solved by recursive 
invocation of the algorithm. However, instead of implementing the algorithm this way, we 
will first describe the so-called dual viewpoint, which leads to a powerful alternative way 
of looking at the multigrid idea. 

The dual viewpoint considers the local truncation error, defined as 

T = C h (u) - fh (19.6.30) 

where u is the exact solution of the original continuum equation. If we rewrite this as 

C h (u) = f h + r (19.6.31) 

we see that r can be regarded as the correction to fh so that the solution of the fine-grid 
equation will be the exact solution u. 

Now consider the relative truncation error Th, which is defined on the 77-grid relative 
to the /i-grid: 


T h = C H (nu h ) - lZC h (u h ) (19.6.32) 

Since Ch(uh) = fh, this can be rewritten as 

Ch{uh) = fH +Th (19.6.33) 

In other words, we can think of r/, as the correction to fH that makes the solution of the 
coarse-grid equation equal to the fine-grid solution. Of course we cannot compute Th, but we 
do have an approximation to it from using Uh in equation (19.6.32): 

r h ~r h = £H{1Zuh) - IZChffdh) (19.6.34) 

Replacing Th by 77 ,, in equation (19.6.33) gives 

Ch(u h ) = Ch(TZu h ) - lZd h (19.6.35) 

which is just the coarse-grid equation (19.6.27)! 

Thus we see that there are two complementary viewpoints for the relation between 
coarse and fine grids: 

• Coarse grids are used to accelerate the convergence of the smooth components 
of the fine-grid residuals. 
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• Fine grids are used to compute correction terms to the coarse-grid equations, 
yielding fine-grid accuracy on the coarse grids. 

One benefit of this new viewpoint is that it allows us to derive a natural stopping criterion 
for a multigrid iteration. Normally the criterion would be 

||c4||<e (19.6.36) 

and the question is how to choose e. There is clearly no benefit in iterating beyond the 
point when the remaining error is dominated by the local truncation error r. The computable 
quantity is Th- What is the relation between r and r/i? For the typical case of a second-order 
accurate differencing scheme, 

r = C h (u) - C h (u h ) = h 2 r 2 {x,y) H- (19.6.37) 

Assume the solution satisfies Uh = u + h 2 u 2 (x, y) + ■ ■ ■. Then, assuming 1Z is of high 
enough order that we can neglect its effect, equation (19.6.32) gives 

Th m Ch{u + h 2 u 2 ) - C h (u + h 2 u 2 ) 

= Ch{u) - Ch(u ) + h 2 [C' H {u 2 ) - C' h (u 2 )\ -\ - (19.6.38) 

^ (H 2 -h 2 )T2+0(h 4 ) 


For the usual case of H = 2h we therefore have 

r ~ | Th ~ \lfh (19.6.39) 

The stopping criterion is thus equation (19.6.36) with 

'fi'ssE ck 11TTi 11, a ~ | (19.6.40) 

We have one remaining task before implementing our nonlinear multigrid algorithm: 
choosing a nonlinear relaxation scheme. Once again, your first choice should probably be 
the nonlinear Gauss-Seidel scheme. If the discretized equation (19.6.23) is written with 
some choice of ordering as 

Li(ui,..., un) = fi, i = (19.6.41) 

then the nonlinear Gauss-Seidel schemes solves 


Li(ui, ... ,Ui-i,u" ew ,Ui+i,... ,un) = fi (19.6.42) 


for uf ew . As usual new u’s replace old w’s as soon as they have been computed. Often equation 
(19.6.42) is linear in «" ew , since the nonlinear terms are discretized by means of its neighbors. 
If this is not the case, we replace equation (19.6.42) by one step of a Newton iteration: 


U; = Uj 


Li{uf A ) - fj 
dLi{u? d )/dui 


For example, consider the simple nonlinear equation 
V 2 u + u = p 


(19.6.43) 

(19.6.44) 


In two-dimensional notation, we have 

C(u t ,j) = (ut+ + m-i t j + Uij+i + v,i,j -1 — 4 Ui,j)/h 2 + u 2 j — pij = 0 (19.6.45) 


Since 

dC = 

dui,j 

the Newton Gauss-Seidel iteration is 

u™? = Uij 


—4/h 2 + 2ui,j 


—4/h 2 + 2 mj 


(19.6.46) 

(19.6.47) 


Flere is a routine mgf as that solves equation (19.6.44) using the Full Multigrid Algorithm 
and the FAS scheme. Restriction and prolongation are done as in mglin. We have included 
the convergence test based on equation (19.6.40). A successful multigrid solution of a problem 
should aim to satisfy this condition with the maximum number of V-cycles, maxcyc, equal to 
1 or 2. The routine mgf as uses the same functions copy, interp, and rstrct as mglin. 
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#include "nrutil.h" 
#define NPRE 1 
#define NPDST 1 
#define ALPHA 0.33 
#define NGMAX 15 


Number of relaxation sweeps before ... 

... and after the coarse-grid correction is computed. 
Relates the estimated truncation error to the norm 
of the residual. 


void mgfas(double **u, int n, int maxcyc) 

Full Multigrid Algorithm for FAS solution of nonlinear elliptic equation, here equation (19.6.44). 
On input u[l. .n] [1. .n] contains the right-hand side p, while on output it returns the solution. 
The dimension n must be of the form 21 + 1 for some integer j. (j is actually the number of 
grid levels used in the solution, called ng below.) maxcyc is the maximum number of V-cycles 
to be used at each level. 

{ 

double anorm2 (double **a, int n) ; 

void copy(double **aout, double **ain, int n); 

void interp(double **uf, double **uc, int nf); 

void lop(double **out, double **u, int n); 

void matadd(double **a, double **b, double **c, int n); 

void matsub(double **a, double **b, double **c, int n); 

void relax2(double **u, double **rhs, int n); 

void rstrct(double **uc, double **uf, int nc); 

void slvsm2(double **u, double **rhs); 

unsigned int j,jcycle,jj,jml,jpost,jpre,nf,ng=0,ngrid,nn; 
double **irho[NGMAX+1],**irhs[NGMAX+1],**itau[NGMAX+1], 

**itemp[NGMAX+1],**iu[NGMAX+1]; 
double res,trerr; 


nn=n; 

while (nn »= 1) ng++; 

if (n != 1+(1L « ng)) nrerror("n-l must be a power of 2 in mgfas."); 
if (ng > NGMAX) nrerror("increase NGMAX in mglin."); 
nn=n/2+l; 


ngrid=ng-l; 

irho[ngrid]=dmatrix(1,nn,1,nn); 
rstrct(irho[ngrid],u,nn); 
while (nn > 3) { 
nn=nn/2+l; 

irho[—ngrid]=dmatrix(l,nn,l,nn); 
rstrct (irho [ngrid] , irho [ngrid+1] 

> 

nn=3; 

iu[l]=dmatrix(l,nn,1,nn); 
irhs[l]=dmatrix(l,nn,l,nn); 
itau[l]=dmatrix(l,nn,l,nn); 
itemp[l]=dmatrix(l,nn,l,nn); 
slvsm2(iu[l] ,irho [1]); 
free_dmatrix(irho[1],l,nn,l,nn); 
ngrid=ng; 

for (j=2;j<=ngrid;j++) { 
nn=2*nn-l; 


Allocate storage for r.h.s. on grid ng — 1, 
and fill it by restricting from the fine grid. 
Similarly allocate storage and fill r.h.s. on all 
coarse grids. 


Initial solution on coarsest grid. 


Nested iteration loop. 


iu[j]=dmatrix(l,nn,1,nn); 
irhs[j]=dmatrix(l,nn,1,nn); 
itau[j]=dmatrix(l,nn,l,nn); 
itemp[j]=dmatrix(l,nn,1,nn); 
interp(iu[j] ,iu[j-l] ,nn); 

Interpolate from coarse grid to next finer grid. 

copy (irhs [j] , (j != ngrid ? irho[j] : u),nn); Set up r.h.s. 
for (jcycle=l; jcycle<=maxcyc; jcycle++) { V-cycle loop. 


nf=nn; 


for (]j=j;jj>=2;j] ) { 

for (jpre=l;jpre<=NPRE;jpre++) 
relax2(iu[jj] ,irhs[jj] ,nf) ; 
lop(itemp[jj],iu[j j],nf); 
nf=nf/2+1; 


Downward stoke of the V. 
Pre-smoothing. 

Ch(u h ). 



s o- i 
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jml=jj-l; 

rstrct (itemp[jml] ,itemp[j j] ,nf ) ; 
rstrct (iu [jml] ,iu[j j] ,nf) ; 
lop(itau[jml] ,iu[jml] ,nf) ; 

Ch^Pmu) stored temporarily in r^. 
matsub(itau[jml] ,itemp[jml] ,itau[jml] ,i 

if Cjj == j) 

trerr=ALPHA*anorm2(itau[jml],nf); 
rstrct(irhs[jml],irhs[jj],nf); 
matadd(irhs [jml] ,itau[jml] ,irhs [jml] ,ni 


for (jj=2;jj<=j;jj++) { 
jml=jj-l; 

rstrct (itemp [jml] ,iu[jj] ,nf); 
matsub(iu[jml] , itemp [jml] ,itemp[jml] ,nf); 
nf=2*nf-l; 

interp(itau[j j] , itemp [jml] ,nf ) ; 
matadd(iu[j j] ,itau[jj] ,iu[j j] ,nf ) ; 
for (jpost=l;jpost<=NPOST;jpost++) 
relax2(iu[jj] ,irhs[jj] ,nf); 

> 

lop(itemp[j] ,iu[j] ,nf) ; 
matsub(itemp[j] ,irhs [j] ,itemp [j] ,nf); 
res=anorm2(itemp[j],nf); 
if (res < trerr) break; 


Estimate truncation error r. 
fH ■ 

); fH+T h . 

Bottom of V: Solve on coars¬ 
est grid. 

Upward stroke of V. 

K-Uh- 

uh ~ TlUh. 

PiuH—PMh) stored in r^. 
Form u£ ew . 

Post-smoothing. 


No more V-cycles needed if 
residual small enough. 


:opy (u, iu [ngrid] , n); 


: (nn=n,j=ng;j> 
free_dmatrix( 
free_dmatrix( 
free_dmatrix( 
free_dmatrix( 


=l;j — ,im=im/2+l) { 
temp[j] .l.nn.l.nn) ; 
tau[j] , 1,nn, 1,nn) ; 
rhs[j] ,l,nn,l,im) ; 
u[j] ,l,nn,l,nn) ; 


if (j != ng Sc& j != 1) free_dmatrix(irho[j] , 1 


void relax2(double **u, double **rhs, int n) 

Red-black Gauss-Seidel relaxation for equation (19.6.44). The current value of the solution 
u[l. .n] [1. .n] is updated, using the right-hand side function rhs[l. .n] [1. .n] . 

{ 

int i,ipass,isw,j,jsw=l; 
double foh2,h,h2i,res; 

h=l.0/(n-l); 
h2i=l.0/(h*h); 
foh2 = -4.0*h2i; 

for (ipass=l; ipass<=2;ipass++, jsw=3-jsw) { Red and black sweeps. 

isw=jsw; 

for (j =2;j<n;j++,isw=3-isw) { 
for (i=isw+l;i<n;i+=2) { 

res=h2i*(u[i+1] [j]+u[i-l] [j]+u[i] [j+l]+u[i] [j-1]- 
4.0*u[i] [j]) +u[i] [j] *u[i] [j] -rhs [i] [j] ; 
u[i] [j] -= res/(foh2+2.0*u[i] [j]) ; Newton Gauss-Seidel formula. 

> 

> 

> 

> 
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#include <math.h> 

void slvsm2(double **u, double **rhs) 

Solution of equation (19.6.44) on the coarsest grid, where h = |. The right-hand side is input 
in rhs [1. .3] [1. .3] and the solution is returned in u[l. . 3] [1. . 3] . 

{ 

void fillOCdouble **u, int n); 
double disc,fact,h=0.5; 

fill0(u,3); 
fact=2.0/(h*h); 

disc=sqrt(fact*fact+rhs[2][2]); 
u[2] [2] = -rhs [2] [2] /(fact+disc); 

> 

void lop(double **out, double **u, int n) 

Given u[l. .n] [1. .n] , returns Ch(uh) for equation (19.6.44) in out [1 . .n] [1. .n] . 

{ 

int i, j ; 
double h,h2i; 

h=l.0/(n-l); 
h2i=l.0/(h*h); 

for (j=2;j<n;j++) Interior points, 

for (i=2;i<n;i++) 

out[i] [j]=h2i*(u[i+l] [j]+u[i-l] [j]+u[i] [j+l]+u[i] [j-1]- 
4.0*u[i] [j])+u[i] [j]*u[i] [j] ; 
for (i=l;i<=n;i++) Boundary points. 

out [i] [1] =out [i] [n] =out [1] [i] =out [n] [i] =0.0; 

> 

void mataddfdouble **a, double **b, double **c, int n) 

Adds a[l. .n] [1. .n] to b [1. .n] [1. .n] and returns result in c [1. .n] [1. .n] . 

{ 

int i, j ; 

for (j=l;j<=n;j++) 

for (i=l;i<=n;i++) 

c[i] [j]=a[i] [j]+b[i] [j]; 

> 

void matsub(double **a, double **b, double **c, int n) 

Subtracts b [1. .n] [1. .n] from a[l. .n] [1 . .n] and returns result in c [1 . .n] [1 . .n] . 

{ 

int i, j ; 

for (j=l;j<=n;j++) 

for (i=l;i<=n;i++) 

c[i] [j]=a[i] [j]-b[i] [j]; 

> 


#include <math.h> 

double anorm2(double **a, int n) 

Returns the Euclidean norm of the matrix a[l. .n] [1. .n] . 

{ 

int i, j ; 
double sum=0.0; 

for (j=l;j<=n;j++) 

for (i=l;i<=n;i++) 

sum += a[i] [j]*a[i] [j] ; 
return sqrt(sum)/n; 
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Chapter 20. Less-Numerical 
Algorithms 

20.0 Introduction 


You can stop reading now. You are done with Numerical Recipes, as such. This 
final chapter is an idiosyncratic collection of “/ess-numerical recipes” which, for one 
reason or another, we have decided to include between the covers of an otherwise 
more -numerically oriented book. Authors of computer science texts, we’ve noticed, 
like to throw in a token numerical subject (usually quite a dull one — quadrature, for 
example). We find that we are not free of the reverse tendency. 

Our selection of material is not completely arbitrary. One topic, Gray codes, was 
already used in the construction of quasi-random sequences (§7.7), and here needs 
only some additional explication. Two other topics, on diagnosing a computer’s 
floating-point parameters, and on arbitrary precision arithmetic, give additional 
insight into the machinery behind the casual assumption that computers are useful 
for doing things with numbers (as opposed to bits or characters). The latter of these 
topics also shows a very different use for Chapter 12’s fast Fourier transform. 

The three other topics (checksums, Huffman and arithmetic coding) involve 
different aspects of data coding, compression, and validation. If you handle a large 
amount of data — numerical data, even — then a passing familiarity with these 
subjects might at some point come in handy. In §13.6, for example, we already 
encountered a good use for Huffman coding. 

But again, you don’t have to read this chapter. (And you should learn about 
quadrature from Chapters 4 and 16, not from a computer science text!) 


20.1 Diagnosing Machine Parameters 

A convenient fiction is that a computer’s floating-point arithmetic is “accurate 
enough.” If you believe this fiction, then numerical analysis becomes a very clean 
subject. Roundoff error disappears from view; many finite algorithms become 
“exact”; only docile truncation error (§1.3) stands between you and a perfect 
calculation. Sounds rather naive, doesn’t it? 

Yes, it is naive. Notwithstanding, it is a fiction necessarily adopted throughout 
most of this book. To do a good job of answering the question of how roundoff error 

889 
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propagates, or can be bounded, for every algorithm that we have discussed would be 
impractical. In fact, it would not be possible: Rigorous analysis of many practical 
algorithms has never been made, by us or anyone. 

Proper numerical analysts cringe when they hear a user say, “I was getting 
roundoff errors with single precision, so I switched to double.” The actual meaning 
is, “for this particular algorithm, and my particular data, double precision seemed 
able to restore my erroneous belief in the ‘convenient fiction’.” We admit that most 
of the mentions of precision or roundoff in Numerical Recipes are only slightly more 
quantitative in character. That comes along with our trying to be “practical.” 

It is important to know what the limitations of your machine’s floating-point 
arithmetic actually are — the more so when your treatment of floating-point roundoff 
error is going to be intuitive, experimental, or casual. Methods for determining 
useful floating-point parameters experimentally have been developed by Cody [1 ], 
Malcolm [2], and others, and are embodied in the routine machar, below, which 
follows Cody’s implementation. 

All of machar’s arguments are returned values. Here is what they mean: 

• ibeta (called B in §1.3) is the radix in which numbers are represented, 
almost always 2, but occasionally 16, or even 10. 

• it is the number of base-ibeta digits in the floating-point mantissa M 
(see Figure 1.3.1). 

• machep is the exponent of the smallest (most negative) power of ibeta 
that, added to 1.0, gives something different from 1.0. 

• eps is the floating-point number ibeta maclle P, loosely referred to as the 
“floating-point precision.” 

• negep is the exponent of the smallest power of ibeta that, subtracted 
from 1.0, gives something different from 1.0. 

• epsneg is ibeta ne S e P, another way of defining floating-point precision. 
Not infrequently epsneg is 0.5 times eps; occasionally eps and epsneg 
are equal. 

• iexp is the number of bits in the exponent (including its sign or bias). 

• minexp is the smallest (most negative) power of ibeta consistent with 
there being no leading zeros in the mantissa. 

• xmin is the floating-point number ibeta minex P, generally the smallest 
(in magnitude) useable floating value. 

• maxexp is the smallest (positive) power of ibeta that causes overflow. 

• xmax is (1 — epsneg) x ibeta maxex P, generally the largest (in magnitude) 
useable floating value. 

• irnd returns a code in the range 0 ... 5, giving information on what kind of 
rounding is done in addition, and on how underflow is handled. See below. 

• ngrd is the number of “guard digits” used when truncating the product of 
two mantissas to fit the representation. 

There is a lot of subtlety in a program like machar, whose purpose is to ferret 
out machine properties that are supposed to be transparent to the user. Further, it must 
do so avoiding error conditions, like overflow and underflow, that might interrupt 
its execution. In some cases the program is able to do this only by recognizing 
certain characteristics of “standard” representations. For example, it recognizes 
the IEEE standard representation [3] by its rounding behavior, and assumes certain 
features of its exponent representation as a consequence. We refer you to [1 ] and 
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Sample Results Returned by machar 

precision 

typical IEEE-c 

single 

:ompliant machine 

double 

DEC VAX 

single 

ibeta 

2 

2 

2 

it 

24 

53 

24 

machep 

-23 

-52 

-24 

eps 

1.19 x 10" 7 

2.22 x 10 -16 

5.96 x 10“ 8 

negep 

-24 

-53 

-24 

epsneg 

5.96 x 10“ 8 

1.11 x 10“ 16 

5.96 x 10 -8 

iexp 

8 

11 

8 

minexp 

-126 

-1022 

-128 

xmin 

1.18 x 10“ 38 

2.23 x 10 - 308 

2.94 x 10 -39 

maxexp 

128 

1024 

127 

xmax 

3.40 x 10 38 

1.79 x 10 308 

1.70 x 10 38 

irnd 

5 

5 

1 

ngrd 

0 

0 

0 


references therein for details. Be aware that machar can give incorrect results on 
some nonstandard machines. 

The parameter irnd needs some additional explanation. In the IEEE standard, 
bit patterns correspond to exact, “representable” numbers. The specified method 
for rounding an addition is to add two representable numbers “exactly,” and then 
round the sum to the closest representable number. If the sum is precisely halfway 
between two representable numbers, it should be rounded to the even one (low-order 
bit zero). The same behavior should hold for all the other arithmetic operations, 
that is, they should be done in a manner equivalent to infinite precision, and then 
rounded to the closest representable number. 

If irnd returns 2 or 5, then your computer is compliant with this standard. If it 
returns 1 or 4, then it is doing some kind of rounding, but not the IEEE standard. If 
irnd returns 0 or 3, then it is truncating the result, not rounding it — not desirable. 

The other issue addressed by irnd concerns underflow. If a floating value is 
less than xmin, many computers underflow its value to zero. Values irnd = 0,1, 
or 2 indicate this behavior. The IEEE standard specifies a more graceful kind of 
underflow: As a value becomes smaller than xmin, its exponent is frozen at the 
smallest allowed value, while its mantissa is decreased, acquiring leading zeros and 
“gracefully” losing precision. This is indicated by irnd = 3,4, or 5. 
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#include <math.h> 

#define CONV(i) ((float)(i)) 

Change float to double here and in declarations below to find double precision parameters. 


void machar(int *ibeta, int *it, int *irnd, int *ngrd, int *machep, int *negep, 
int *iexp, int *minexp, int *maxexp, float *eps, float *epsneg, 
float *xmin, float *xmax) 

Determines and returns machine-specific parameters affecting floating-point arithmetic. Re¬ 
turned values include ibeta, the floating-point radix; it, the number of base-ibeta digits in 
the floating-point mantissa; eps, the smallest positive number that, added to 1.0, is not equal 
to 1.0; epsneg, the smallest positive number that, subtracted from 1.0, is not equal to 1.0; 
xmin, the smallest representable positive number; and xmax, the largest representable positive 
number. See text for description of other returned parameters. 

{ 

int i,itemp,iz,j,k,mx,nxres; 

float a,b,beta,betah,betain,one,t,temp,tempi,tempa,two,y,z,zero; 

one=C0NV(l); 
two=one+one; 
zero=one-one; 

a=one; Determine ibeta and beta by the method of M. 

do { Malcolm, 

a += a; 
temp=a+one; 
templ=temp-a; 

} while (templ-one == zero); 
b=one; 
do { 

b += b; 
temp=a+b; 

itemp=(int)(temp-a); 

} while (itemp == 0); 

*ibeta=itemp; 
beta=C0NV(*ibeta); 

*it=0; Determine it and irnd. 

b=one; 
do { 

++(*it); 
b *= beta; 
temp=b+one; 
templ=temp-b; 

}■ while (templ-one == zero); 

*irnd=0; 
betah=beta/two; 
temp=a+betah; 

if (temp-a != zero) *irnd=l; 
tempa=a+beta; 
temp=tempa+betah; 

if (*irnd == 0 kk temp-tempa != zero) *irnd=2; 

*negep=(*it)+3; Determine negep and epsneg. 

betain=one/beta; 
a=one; 

for (i=l;i<=(*negep);i++) a *= betain; 


for (;;) { 

temp=one-a; 

if (temp-one != zero) break; 
a *= beta; 

—(*negep); 

> 

♦negep = -(*negep); 

*epsneg=a; 

♦machep = -(*it)-3; 
a=b; 



Determine machep and eps. 
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for (;;) { 

temp=one+a; 

if (temp-one != zero) break; 
a *= beta; 

++(*machep); 

> 

*eps=a; 

*ngrd=0; Determine ngrd. 

temp=one+(*eps); 

if (*irnd == 0 kk temp*one-one != zero) *ngrd=l; 
i=0; Determine iexp. 

k=l; 

z=betain; 
t=one+(*eps); 
nxres=0; 

for (;;) { Loop until an underflow occurs, then exit. 

y=z; 

Z=y*y; 

a=z*one; Check here for the underflow. 

temp=z*t; 

if (a+a == zero I I fabs(z) >= y) break; 

templ=temp*betain; 

if (tempi*beta == z) break; 

++i ; 
k += k; 

> 

if (*ibeta != 10) { 

*iexp=i+l; 
mx=k+k; 

> else { For decimal machines only. 

*iexp=2; 
iz=(*ibeta); 
while (k >= iz) { 
iz *= *ibeta; 

++(*iexp); 

> 

mx=iz+iz-l; 

> 

for (;;) { To determine minexp and xmin, loop until an 

*xmin=y; underflow occurs, then exit, 

y *= betain; 

a=y*one; Check here for the underflow. 

temp=y*t; 

if (a+a != zero kk fabs(y) < *xmin) { 

++k; 

templ=temp*betain; 

if (templ*beta == y kk temp != y) { 
nxres=3; 

*xmin=y; 
break; 

> 

> 

else break; 

> 

*minexp = -k; Determine maxexp, xmax. 

if (mx <= k+k-3 kk *ibeta != 10) { 


++(*iexp); 

> 

*maxexp=mx+(*minexp); 

*irnd += nxres; Adjust irnd to reflect partial underflow, 

if (*irnd >= 2) *maxexp -= 2; Adjust for lEEE-style machines. 

i=(*maxexp)+(*minexp); 

Adjust for machines with implicit leading bit in binary mantissa, and machines with radix 
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point at extreme right of mantissa, 
if (*ibeta == 2 && !i) —(*maxexp); 
if (i > 20) —(*maxexp); 
if (a != y) *maxexp -= 2; 

*xmax=one-(*epsneg); 

if ((*xmax)*one != *xmax) *xmax=one-beta*(*epsneg); 
*xmax /= (*xmin*beta*beta*beta); 
i=(*maxexp)+(*minexp)+3; 
for (j=l;j<=i;j++) { 

if (*ibeta == 2) *xmax += *xmax; 
else *xmax *= beta; 

> 


Some typical values returned by machar are given in the table, above. IEEE- 
compliant machines referred to in the table include most UNIX workstations (SUN, 
DEC, MIPS), and Apple Macintosh IIs. IBM PCs with floating co-processors 
are generally IEEE-compliant, except that some compilers underflow intermediate 
results ungracefully, yielding irnd = 2 rather than 5. Notice, as in the case of a VAX 
(fourth column), that representations with a “phantom” leading 1 bit in the mantissa 
achieve a smaller eps for the same wordlength, but cannot underflow gracefully. 

CITED REFERENCES AND FURTHER READING: 

Goldberg, D. 1991, ACM Computing Surveys, vol. 23, pp. 5-48. 

Cody, W.J. 1988, ACM Transactions on Mathematical Software, vol. 14, pp. 303-311. [1 ] 
Malcolm, M.A. 1972, Communications of the ACM, vol. 15, pp. 949-951. [2] 

IEEE Standard for Binary Floating-Point Numbers, ANSI/IEEE Std 754-1985 (New York: IEEE, 
1985). [3] 


20.2 Gray Codes 

A Gray code is a function G(i) of the integers i, that for each integer N > 0 
is one-to-one for 0 < i < 2 N — 1, and that has the following remarkable property: 
The binary representation of G(i ) and G(i + 1) differ in exactly one bit. An example 
of a Gray code (in fact, the most commonly used one) is the sequence 0000, 0001, 

ooii, ooio, ono, oin, oioi, oioo, iioo, noi, mi, mo, 1010 , 1011 , 1001 , 

and 1000, for * = 0,..., 15. The algorithm for generating this code is simply to 
form the bitwise exclusive-or (XOR) of i with i/2 (integer part). Think about how 
the carries work when you add one to a number in binary, and you will be able to see 
why this works. You will also see that G(i) and G(i + 1) differ in the bit position of 
the rightmost zero bit of i (prefixing a leading zero if necessary). 

The spelling is “Gray,” not “gray”: The codes are named after one Frank Gray, 
who first patented the idea for use in shaft encoders. A shaft encoder is a wheel with 
concentric coded stripes each of which is “read” by a fixed conducting brush. The 
idea is to generate a binary code describing the angle of the wheel. The obvious, 
but wrong, way to build a shaft encoder is to have one stripe (the innermost, say) 
conducting on half the wheel, but insulating on the other half; the next stripe is 
conducting in quadrants 1 and 3; the next stripe is conducting in octants 1, 3, 5, 
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(b) 


Figure 20.2.1. Single-bit operations for calculating the Gray code G(i ) from i (a), or the inverse (b). 
LSB and MSB indicate the least and most significant bits, respectively. XOR denotes exclusive-or. 


and 7; and so on. The brushes together then read a direct binary code for the 
position of the wheel. 

The reason this method is bad, is that there is no way to guarantee that all the 
brushes will make or break contact exactly simultaneously as the wheel turns. Going 
from position 7 (0111) to 8 (1000), one might pass spuriously and transiently through 
6 (0110), 14 (1110), and 10 (1010), as the different brushes make or break contact. 
Use of a Gray code on the encoding stripes guarantees that there is no transient state 
between 7 (0100 in the sequence above) and 8 (1100). 

Of course we then need circuitry, or algorithmics, to translate from G(i) to i. 
Figure 20.2.1 (b) shows how this is done by a cascade of XOR gates. The idea is 
that each output bit should be the XOR of all more significant input bits. To do 
N bits of Gray code inversion requires N — 1 steps (or gate delays) in the circuit. 
(Nevertheless, this is typically very fast in circuitry.) In a register with word-wide 
binary operations, we don’t have to do N consecutive operations, but only In 2 N. 
The trick is to use the associativity of XOR and group the operations hierarchically. 
This involves sequential right-shifts by 1,2,4,8,... bits until the wordlength is 
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exhausted. Here is a piece of code for doing both G(i) and its inverse. 


unsigned long igray(unsigned long n, int is) 

For zero or positive values of is, return the Gray code of n; if is is negative, return the inverse 
Gray code of n. 

{ 

int ish; 

unsigned long ans,idiv; 


> 


if (is >= 0) 

return n ‘ (n » 
ish=l; 
ans=n; 
for (;;) { 

ans ~= (idiv=ans 
if (idiv <= 1 | | 
ish «= 1; 


This is the easy direction! 

1 ); 

This is the more complicated direction: In hierarchical 
stages, starting with a one-bit right shift, cause each 
bit to be XORed with all more significant bits. 

» ish); 

ish == 16) return ans; 

Double the amount of shift on the next cycle. 


In numerical work, Gray codes can be useful when you need to do some task 
that depends intimately on the bits of i, looping over many values of i. Then, if there 
are economies in repeating the task for values differing by only one bit, it makes 
sense to do things in Gray code order rather than consecutive order. We saw an 
example of this in §7.7, for the generation of quasi-random sequences. 


CITED REFERENCES AND FURTHER READING: 

Horowitz, P., and Hill, W. 1989, The Art of Electronics, 2nd ed. (New York: Cambridge University 
Press), §8.02. 

Knuth, D.E. Combinatorial Algorithms, vol. 4 of The Art of Computer Programming (Reading, 
MA: Addison-Wesley), §7.2.1. [Unpublished. Will it be always so?] 


20.3 Cyclic Redundancy and Other Checksums 

When you send a sequence of bits from point A to point B, you want to know 
that it will arrive without error. A common form of insurance is the “parity bit,” 
attached to 7-bit ASCII characters to put them into 8-bit format. The parity bit is 
chosen so as to make the total number of one-bits (versus zero-bits) either always 
even (“even parity”) or always odd (“odd parity”). Any single bit error in a character 
will thereby be detected. When errors are sufficiently rare, and do not occur closely 
bunched in time, use of parity provides sufficient error detection. 

Unfortunately, in real situations, a single noise “event” is likely to disrupt more 
than one bit. Since the parity bit has two possible values (0 and 1), it gives, on 
average, only a 50% chance of detecting an erroneous character with more than one 
wrong bit. That probability, 50%, is not nearly good enough for most applications. 
Most communications protocols [1 ] use a multibit generalization of the parity bit 
called a “cyclic redundancy check” or CRC. In typical applications the CRC is 16 
bits long (two bytes or two characters), so that the chance of a random error going 
undetected is 1 in 2 16 = 65536. Moreover, M- bit CRCs have the mathematical 
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property of detecting all errors that occur in M or fewer consecutive bits, for any 
length of message. (We prove this below.) Since noise in communication channels 
tends to be “bursty,” with short sequences of adjacent bits getting corrupted, this 
consecutive-bit property is highly desirable. 

Normally CRCs lie in the province of communications software experts and 
chip-level hardware designers — people with bits under their fingernails. However, 
there are at least two kinds of situations where some understanding of CRCs can be 
useful to the rest of us. First, we sometimes need to be able to communicate with 
a lower-level piece of hardware or software that expects a valid CRC as part of its 
input. For example, it can be convenient to have a program generate XMODEM 
or Kermit [2] packets directly into the communications line rather than having to 
store the data in a local file. 

Second, in the manipulation of large quantities of (e.g., experimental) data, it 
is useful to be able to tag aggregates of data (whether numbers, records, lines, or 
whole files) with a statistically unique “key,” its CRC. Aggregates of any size can 
then be compared for identity by comparing only their short CRC keys. Differing 
keys imply nonidentical records. Identical keys imply, to high statistical certainty, 
identical records. If you can’t tolerate the very small probability of being wrong, you 
can do a full comparison of the records when the keys are identical. When there is a 
possibility of files or data records being inadvertently or irresponsibly modified (for 
example, by a computer virus), it is useful to have their prior CRCs stored externally 
on a physically secure medium, like a floppy disk. 

Sometimes CRCs can be used to compress data as it is recorded. If identical data 
records occur frequently, one can keep sorted in memory the CRCs of previously 
encountered records. A new record is archived in full if its CRC is different, 
otherwise only a pointer to a previous record need be archived. In this application 
one might desire a 4- or 8-byte CRC, to make the odds of mistakenly discarding 
a different data record be tolerably small; or, if previous records can be randomly 
accessed, a full comparison can be made to decide whether records with identical 
CRCs are in fact identical. 

Now let us briefly discuss the theory of CRCs. After that, we will give 
implementations of various (related) CRCs that are used by the official or de facto 
standard protocols [1 -3] listed in the accompanying table. 

The mathematics underlying CRCs is “polynomials over the integers modulo 
2.” Any binary message can be thought of as a polynomial with coefficients 0 and 1. 
For example, the message “1100001101” is the polynomial x 9 + a; 8 + x 3 + x 2 + 1. 
Since 0 and 1 are the only integers modulo 2, a power of x in the polynomial is 
either present (1) or absent (0). A polynomial over the integers modulo 2 may be 
irreducible, meaning that it can’t be factored. A subset of the irreducible polynomials 
are the “primitive” polynomials. These generate maximum length sequences when 
used in shift registers, as described in §7.4. The polynomial x 2 +1 is not irreducible: 
x 2 -|-l = (x+l)(x+l), so it is also not primitive. The polynomial x 4 +x 3 +x 2 +x+l 
is irreducible, but it turns out not to be primitive. The polynomial x 4 + x + 1 is 
both irreducible and primitive. 

An M -bit long CRC is based on a primitive polynomial of degree M, called 
the generator polynomial. Alternatively, the generator is chosen to be a primitive 
polynomial times (1 + a:) (this finds all parity errors). For 16-bit CRC’s, the CCITT 
(Comite Consultatif International Telegraphique et Telephonique) has anointed the 
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| Conventions and Test Values for Various CRC Protocols j 


| icrc args 

Test Values (C 2 C 1 in hex) 

Packet 

Protocol 

jinit 

jrev 

T 

CatMouse987654321 

Format 

CRC 

XMODEM 

0 

i 

1A71 

E556 

SiS2-..S N C 2 C 1 

0 

X.25 

255 

-i 

1B26 

F56E 

SiS 2 ■ ■ • S/vCiCi 

FOB 8 

(no name) 

255 

-i 

1B26 

F56E 

S 1 S 2 ■ ■ • S/vCiCi 

0 

SDLC (IBM) 

same as X.25 

HDLC (ISO) 

same as X.25 

CRC-CCITT 

0 

-i 

14A1 

C28D 

S 1 S 2 ■ ■ • S/vCiCi 

0 

(no name) 

0 

-i 

14A1 

C28D 

S 1 S 2 ■ ■ ■ SjvCiCh: 

FOB 8 

Kermit 

| same as CRC-CCITT 

see Notes j 

Notes: Overbar denotes bit complement. Si ... Sn are character data. Ci is CRC’s least 

significant 8 bits, C 2 is its most significant 8 bits, so CRC = 256 C 2 + Ci (shown 
in hex). Kermit (block check level 3) sends the CRC as 3 printable ASCII characters 
(sends value +32). These contain, respectively, 4 most significant bits, 6 middle bits, 
6 least significant bits. 


“CCITT polynomial,” which is x 16 + x 12 + x 5 +1. This polynomial is used by all of 
the protocols listed in the table. Another common choice is the “CRC-16” polynomial 
x 16 + x 15 + x 2 + 1, which is used for EBCDIC messages in IBM’s BISYNCH [1], 
A common 12-bit choice, “CRC-12,” i s x 12 + x 11 + x 3 + x + 1. A common 32-bit 
choice, “AUTODIN-H,” is x 32 + x 26 + x 23 + x 22 + x 16 + x 12 + x 11 + x 10 + x 8 + 
x 7 + x 5 + x 4 + x 2 + x +1. For a table of some other primitive polynomials, see §7.4. 

Given the generator polynomial G of degree M (which can be written either in 
polynomial form or as a bit-string, e.g., 10001000000100001 for CCITT), here is 
how you compute the CRC for a sequence of bits S: First, multiply S by x M , that is, 
append M zero bits to it. Second divide — by long division — G into Sx M . Keep 
in mind that the subtractions in the long division are done modulo 2, so that there 
are never any “borrows”: Modulo 2 subtraction is the same as logical exclusive-or 
(XOR). Third, ignore the quotient you get. Fourth, when you eventually get to a 
remainder, it is the CRC, call it C. C will be a polynomial of degree M — 1 or less, 
otherwise you would not have finished the long division. Therefore, in bit string 
form, it has M bits, which may include leading zeros. (C might even be all zeros, 
see below.) See [3] for a worked example. 

If you work through the above steps in an example, you will see that most of 
what you write down in the long-division tableau is superfluous. You are actually just 
left-shifting sequential bits of S, from the right, into an M-bit register. Every time a 1 
bit gets shifted off the left end of this register, you zap the register by an XOR with the 
M low order bits of G (that is, all the bits of G except its leading 1). When a 0 bit is 
shifted off the left end you don’t zap the register. When the last bit that was originally 
part of S gets shifted off the left end of the register, what remains is the CRC. 

You can immediately recognize how efficiently this procedure can be imple¬ 
mented in hardware. It requires only a shift register with a few hard-wired XOR 
taps into it. That is how CRCs are computed in communications devices, by a single 
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chip (or small part of one). In software, the implementation is not so elegant, since 
bit-shifting is not generally very efficient. One therefore typically finds (as in our 
implementation below) table-driven routines that pre-calculate the result of a bunch 
of shifts and XORs, say for each of 256 possible 8-bit inputs [4], 

We can now see how the CRC gets its ability to detect all errors in M consecutive 
bits. Suppose two messages, S and T, differ only within a frame of M bits. Then 
their CRCs differ by an amount that is the remainder when G is divided into 
(S — T)x m = D. Now D has the form of leading zeros (which can be ignored), 
followed by some l’s in an M-bit frame, followed by trailing zeros (which are just 
multiplicative factors of x ): D = x n F where F is a polynomial of degree at most 
M — 1 and n > 0. Since G is always primitive or primitive times (1 + x), it is not 
divisible by x. So G cannot divide D. Therefore S and T must have different CRCs. 

In most protocols, a transmitted block of data consists of some N data bits, 
directly followed by the M bits of their CRC (or the CRC XORed with a constant, 
see below). There are two equivalent ways of validating a block at the receiving end. 
Most obviously, the receiver can compute the CRC of the data bits, and compare it to 
the transmitted CRC bits. Less obviously, but more elegantly, the receiver can simply 
compute the CRC of the total block, with N + M bits, and verify that a result of zero 
is obtained. Proof: The total block is the polynomial Sx M + C (data left-shifted to 
make room for the CRC bits). The definition of C is that Sx m = QG + C, where 
Q is the discarded quotient. But then Sx M + C = QG + C + C = QG (remember 
modulo 2), which is a perfect multiple of G. It remains a multiple of G when it gets 
multiplied by an additional x M on the receiving end, so it has a zero CRC, q.e.d. 

A couple of small variations on the basic procedure need to be mentioned [1,3]: 
First, when the CRC is computed, the M-bit register need not be initialized to zero. 
Initializing it to some other M-bit value (e.g., all l’s) in effect prefaces all blocks by 
a phantom message that would have given the initialization value as its remainder. 
It is advantageous to do this, since the CRC described thus far otherwise cannot 
detect the addition or removal of any number of initial zero bits. (Loss of an initial 
bit, or insertion of zero bits, are common “clocking errors.”) Second, one can add 
(XOR) any M-bit constant K to the CRC before it is transmitted. This constant 
can either be XORed away at the receiving end, or else it just changes the expected 
CRC of the whole block by a known amount, namely the remainder of dividing G 
into Kx M . The constant K is frequently “all bits,” changing the CRC into its ones 
complement. This has the advantage of detecting another kind of error that the CRC 
would otherwise not find: deletion of an initial 1 bit in the message with spurious 
insertion of a 1 bit at the end of the block. 

The accompanying function icrc implements the above CRC calculation, 
including the possibility of the mentioned variations. Input to the function is a 
pointer to an array of characters, and the length of that array, icrc has two “switch” 
arguments that specify variations in the CRC calculation. A zero or positive value 
of jinit causes the 16-bit register to have each byte initialized with the value 
j init. A negative value of jrev causes each input character to be interpreted as 
its bit-reverse image, and a similar bit reversal to be done on the output CRC. You 
do not have to understand this; just use the values of j init and j rev specified in 
the table. (If you insist on knowing, the explanation is that serial data ports send 
characters least-significant bit first (!), and many protocols shift bits into the CRC 
register in exactly the order received.) The table shows how to construct a block 
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of characters from the input array and output CRC of icrc. You should not need 
to do any additional bit-reversal outside of icrc. 

The switch j init has one additional use: When negative it causes the input 
value of the array crc to be used as initialization of the register. If you set crc to the 
result of the last call to icrc, this in effect appends the current input array to that of 
the previous call or calls. Use this feature, for example, to build up the CRC of a 
whole file a line at a time, without keeping the whole file in memory. 

The routine icrc is loosely based on the function in [4], Here is how to 
understand its operation: First look at the function icrcl. This incorporates one 
input character into a 16-bit CRC register. The only trick used is that character bits 
are XORed into the most significant bits, eight at a time, instead of being fed into 
the least significant bit, one bit at a time, at the time of the register shift. This works 
because XOR is associative and commutative — we can feed in character bits any 
time before they will determine whether to zap with the generator polynomial. (The 
decimal constant 4129 has the generator’s bits in it.) 

unsigned short icrcl(unsigned short crc, unsigned char onech) 

Given a remainder up to now, return the new CRC after one character is added. This routine 
is functionally equivalent to icrc(, ,1,-1,1), but slower. It is used by icrc to initialize its 
table. 

{. 

int i; 

unsigned short ans=(crc * onech « 8); 

for (i=0;i<8;i++) { Here is where 8 one-bit shifts, and some XORs with the 

if (ans & 0x8000) generator polynomial, are done, 

ans = (ans <<= 1) " 4129; 

else 

ans «= 1; 

> 

return ans; 

> 


Now look at icrc. There are two parts to understand, how it builds a table 
when it initializes, and how it uses that table later on. Go back to thinking about a 
character’s bits being shifted into the CRC register from the least significant end. The 
key observation is that while 8 bits are being shifted into the register’s low end, all 
the generator zapping is being determined by the bits already in the high end. Since 
XOR is commutative and associative, all we need is a table of the result of all this 
zapping, for each of 256 possible high-bit configurations. Then we can play catch-up 
and XOR an input character into the result of a lookup into this table. The only other 
content to icrc is the construction at initialization time of an 8-bit bit-reverse table 
from the 4-bit table stored in it, and the logic associated with doing the bit reversals. 
References [4-6] give further details on table-driven CRC computations. 

typedef unsigned char uchar; 

#define LOBYTE(x) ((uchar)((x) & OxFF)) 

#define HIBYTE(x) ((uchar)((x) » 8)) 

unsigned short icrc(unsigned short crc, unsigned char *bufptr, 
unsigned long len, short jinit, int jrev) 

Computes a 16-bit Cyclic Redundancy Check for a byte array bufptr[l. .len], using any 
of several conventions as determined by the settings of jinit and jrev (see accompanying 
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table). If jinit is negative, then crc is used on input to initialize the remainder register, in 
effect (for crc set to the last returned value) concatenating bufptr to the previous call. 

{ 

unsigned short icrcl(unsigned short crc, unsigned char onech); 
static unsigned short icrctb[256],init=0; 
static uchar rchr[256]; 
unsigned short j,cword=crc; 

static uchar it[16] ={0,8,4,12,2,10,6,14,1,9,5,13,3,11,7,15>; 

Table of 4-bit bit-reverses. 

if (Unit) { Do we need to initialize tables? 

init=l; 

for (j=0;j<=255;j++) { 

The two tables are: CRCs of all characters, and bit-reverses of all characters. 
icrctb[j]=icrcl(j « 8,(uchar)0); 
rchr [j] = (uchar) (it [j & OxF] « 4 I it[j » 4]); 

> 

> 

if (jinit >= 0) cword=((uchar) jinit) I (((uchar) jinit) « 8); 

Initialize the remainder register. 

else if (jrev < 0) cword=rchr[MBYTE(cword)] I rchr[L0BYTE(cword)] « 8; 

If not initializing, do we reverse the register? 

for (j=l; j<=len; j++) Main loop over the characters in the array. 

cword=icrctb[(jrev < 0 ? rchr[bufptr[j]] : 

bufptr[j]) ‘ HIBYTE(cword)] * L0BYTE(cword) « 8; 
return (jrev >= 0 ? cword : rchr[HIBYTE(cword)] I rchr[L0BYTE(cword)] « 8); 
Do we need to reverse the output? 


What if you need a 32-bit checksum? For a true 32-bit CRC, you will need 
to rewrite the routines given to work with a longer generating polynomial. For 
example, x 32 + x 7 + x 5 + x 3 + x 2 + x +1 is primitive modulo 2, and has nonleading, 
nonzero bits only in its least significant byte (which makes for some simplification). 
The idea of table lookup on only the most significant byte of the CRC register goes 
through unchanged. 

If you do not care about the M-consecutive bit property of the checksum, but 
rather only need a statistically random 32 bits, then you can use icrc as given 
here: Call it once with jrev = 1 to get 16 bits, and again with jrev = — 1 to get 
another 16 bits. The internal bit reversals make these two 16-bit CRCs in effect 
totally independent of each other. 

Other Kinds of Checksums 

Quite different from CRCs are the various techniques used to append a decimal 
“check digit” to numbers that are handled by human beings (e.g., typed into a 
computer). Check digits need to be proof against the kinds of highly structured 
errors that humans tend to make, such as transposing consecutive digits. Wagner and 
Putter [7] give an interesting introduction to this subject, including specific algorithms. 

Checksums now in widespread use vary from fair to poor. The 10-digit ISBN 
(International Standard Book Number) that you find on most books, including this 
one, uses the check equation 



10di + 9d 2 + 8d 3 + • • • + 2d 9 + dio — 0 (mod 11) (20.3.1) 

where dio is the right-hand check digit. The character “X” is used to represent a 
check digit value of 10. Another popular scheme is the so-called “IBM check,” often 
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used for account numbers (including, e.g., MasterCard). Here, the check equation is 

2#di +d 2 + 2 #d 3 + d 4 + ■ ■ ■ = 0 (mod 10) (20.3.2) 

where 2means, “multiply d by two and add the resulting decimal digits.” United 
States banks code checks with a 9-digit processing number whose check equation is 

3«i -(- 7(i2 U 3 -(- 3a 4 7a 5 U 6 - 1 - 3 ct 7 - 1 - 7Gig - 1 - uq = 0 (mod 10) (20.3.3) 

The bar code put on many envelopes by the U.S. Postal Service is decoded by 
removing the single tall marker bars at each end, and breaking the remaining bars 
into 6 or 10 groups of five. In each group the five bars signify (from left to right) 
the values 7,4,2,1,0. Exactly two of them will be tall. Their sum is the represented 
digit, except that zero is represented as 7 + 4. The 5- or 9-digit Zip Code is followed 
by a check digit, with the check equation 

^> = 0 (mod 10) (20.3.4) 

None of these schemes is close to optimal. An elegant scheme due to Verhoeff 
is described in [7], The underlying idea is to use the ten-element dihedral group D 5 , 
which corresponds to the symmetries of a pentagon, instead of the cyclic group of 
the integers modulo 10. The check equation is 

ai*f(a 2 )*f 2 (a 3 )* ■ ■ ■ */”- 1 (a„) = 0 (20.3.5) 

where * is (noncommutative) multiplication in D 5 , and /* denotes the ith iteration 
of a certain fixed permutation. Verhoeff’s method finds all single errors in a string, 
and all adjacent transpositions. It also finds about 95% of twin errors ( aa —* bb ), 
jump transpositions ( acb —+ bca ), and jump twin errors ( aca —* bcb). Here is an 
implementation: 

int decchk(char stringG , int n, char *ch) 

Decimal check digit computation or verification. Returns as ch a check digit for appending 
to stringfl. .n] , that is, for storing into string[n+1] . In this mode, ignore the returned 
boolean (integer) value. If string[l. .n] already ends with a check digit (string[n]), re¬ 
turns the function value true (1) if the check digit is valid, otherwise false (0). In this mode, 
ignore the returned value of ch. Note that string and ch contain ASCII characters corre¬ 
sponding to the digits 0-9, not byte values in that range. Other ASCII characters are allowed in 
string, and are ignored in calculating the check digit. 

1 

char c; 

int j,k=0,m=0; 

static int ip[10] [8]={0,1,5,8,9,4,2,7,1,5, 8,9,4,2,7,0,2,7,0,1, 

5,8,9,4,3,6,3,6,3,6, 3,6,4,2,7,0,1,5,8,9, 5,8,9,4,2,7,0,1,6,3, 

6,3,6,3,6,3,7,0,1,5, 8,9,4,2,8,9,4,2,7,0, 1,6,9,4,2,7,0,1,5,81; 
static int ij[10][10]=10,1,2,3,4,5,6,7,8,9, 1,2,3,4,0,6,7,8,9,5, 

2,3,4,0,1,7,8,9,5,6, 3,4,0,1,2,8,9,5,6,7, 4,0,1,2,3,9,5,6,7,8, 

5,9,8,7,6,0,4,3,2,1, 6,5,9,8,7,1,0,4,3,2, 7,6,5,9,8,2,1,0,4,3, 

8,7,6,5,9,3,2,1,0,4, 9,8,7,6,5,4,3,2,1,01; 

Group multiplication and permutation tables. 

for (j=0; j<n; j++) { Look at successive characters. 

c=string[j] ; 
if (c >= 48 c <= 57) 



Ignore everything except digits. 
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k=ij [k] [ip[(c+2) '/„ 10] [7 & m++]]; 

> 

for (j=0; j<=9; j++) Find which appended digit will check properly. 

if (!ij [k] [ip[j] [m & 7]]) break; 

*ch=j+48; Convert to ASCII, 

return k==0; 
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20.4 Huffman Coding and Compression of Data 

A lossless data compression algorithm takes a string of symbols (typically 
ASCII characters or bytes) and translates it reversibly into another string, one that is 
on the average of shorter length. The words “on the average” are crucial; it is obvious 
that no reversible algorithm can make all strings shorter — there just aren’t enough 
short strings to be in one-to-one correspondence with longer strings. Compression 
algorithms are possible only when, on the input side, some strings, or some input 
symbols, are more common than others. These can then be encoded in fewer bits 
than rarer input strings or symbols, giving a net average gain. 

There exist many, quite different, compression techniques, corresponding to 
different ways of detecting and using departures from equiprobability in input strings. 
In this section and the next we shall consider only variable length codes with defined 
word inputs. In these, the input is sliced into fixed units, for example ASCII 
characters, while the corresponding output comes in chunks of variable size. The 
simplest such method is Huffman coding [1 ], discussed in this section. Another 
example, arithmetic compression, is discussed in §20.5. 

At the opposite extreme from defined-word, variable length codes are schemes 
that divide up the input into units of variable length (words or phrases of English text, 
for example) and then transmit these, often with a fixed-length output code. The most 
widely used code of this type is the Ziv-Lempel code [2], References [3-6] give the 
flavor of some other compression techniques, with references to the large literature. 

The idea behind Huffman coding is simply to use shorter bit patterns for more 
common characters. We can make this idea quantitative by considering the concept 
of entropy. Suppose the input alphabet has N c h characters, and that these occur in 
the input string with respective probabilities pt, i = 1 ,..., N c h, so that = 1. 
Then the fundamental theorem of information theory says that strings consisting of 
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independently random sequences of these characters (a conservative, but not always 
realistic assumption) require, on the average, at least 

H = — ^ pi log 2 Pi (20.4.1) 

bits per character. Here H is the entropy of the probability distribution. Moreover, 
coding schemes exist which approach the bound arbitrarily closely. For the case of 
equiprobable characters, with all Pi = 1/N c h, one easily sees that H = log 2 N c h, 
which is the case of no compression at all. Any other set of p^s gives a smaller 
entropy, allowing some useful compression. 

Notice that the bound of (20.4.1) would be achieved if we could encode character 
i with a code of length Li = — log 2 Pi bits: Equation (20.4.1) would then be the 
average PiL% ■ The trouble with such a scheme is that — log 2 Pi is not generally 
an integer. How can we encode the letter “Q” in 5.32 bits? Huffman coding makes 
a stab at this by, in effect, approximating all the probabilities p, by integer powers 
of 1/2, so that all the Lj’s are integral. If all the p,’s are in fact of this form, then 
a Huffman code does achieve the entropy bound H. 

The construction of a Huffman code is best illustrated by example. Imagine 
a language, Vowellish, with the N c h = 5 character alphabet A, E, I, O, and U, 
occurring with the respective probabilities 0.12,0.42,0.09, 0.30, and 0.07. Then the 
construction of a Huffman code for Vowellish is accomplished in the following table: 


Node 

Stage: 1 2 3 4 5 

1 

A: 0.12 0.121 

2 

E: 0.42 0.42 0.42 0.42 ■ 

3 

I: 0.09 ■ 

4 

O: 0.30 0.30 0.30 ■ 

5 

U: 0.07 ■ 

6 

UI: 0.161 

7 

AUI: 0.281 

8 

AUIO: 0.581 

9 

EAUIO: 1.00 


Here is how it works, proceeding in sequence through N c p stages, represented 
by the columns of the table. The first stage starts with N c h nodes, one for each 
letter of the alphabet, containing their respective relative frequencies. At each stage, 
the two smallest probabilities are found, summed to make a new node, and then 
dropped from the list of active nodes. (A “block” denotes the stage where a node is 
dropped.) All active nodes (including the new composite) are then carried over to 
the next stage (column). In the table, the names assigned to new nodes (e.g., AUI) 
are inconsequential. In the example shown, it happens that (after stage 1) the two 
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Figure 20.4.1. Huffman code for the fictitious language Vowellish, in tree form. A letter (A, E, I, 
O, or U) is encoded or decoded by traversing the tree from the top down; the code is the sequence of 
0’s and l’s on the branches. The value to the right of each node is its probability; to the left, its node 
number in the accompanying table. 

smallest nodes are always an original node and a composite one; this need not be 
true in general: The two smallest probabilities might be both original nodes, or both 
composites, or one of each. At the last stage, all nodes will have been collected into 
one grand composite of total probability 1. 

Now, to see the code, you redraw the data in the above table as a tree (Figure 
20.4.1). As shown, each node of the tree corresponds to a node (row) in the table, 
indicated by the integer to its left and probability value to its right. Terminal nodes, 
so called, are shown as circles; these are single alphabetic characters. The branches 
of the tree are labeled 0 and 1. The code for a character is the sequence of zeros and 
ones that lead to it, from the top down. For example, E is simply 0, while U is 1010. 

Any string of zeros and ones can now be decoded into an alphabetic sequence. 
Consider, for example, the string 1011111010. Starting at the top of the tree we 
descend through 1011 to I, the first character. Since we have reached a terminal 
node, we reset to the top of the tree, next descending through 11 to O. Finally 1010 
gives U. The string thus decodes to IOU. 

These ideas are embodied in the following routines. Input to the first routine 
hufmak is an integer vector of the frequency of occurrence of the nchin = N c h 
alphabetic characters, i.e., a set of integers proportional to the Pi’s. hufmak, along 
with huf app, which it calls, performs the construction of the above table, and also the 
tree of Figure 20.4.1. The routine utilizes a heap structure (see §8.3) for efficiency; 
for a detailed description, see Sedgewick [7], 
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#include "nrutil.h" 
typedef struct { 

unsigned long *icod,*ncod,*left,*right,nch,nodemax; 

} huffcode; 

void hufmak(unsigned long nfreqG, unsigned long nchin, unsigned long *ilong, 
unsigned long *nlong, huffcode *hcode) 

Given the frequency of occurrence table nfreq[l. .nchin] of nchin characters, construct the 
Huffman code in the structure hcode. Returned values ilong and nlong are the character 
number that produced the longest code symbol, and the length of that symbol. You should 
check that nlong is not larger than your machine's word length. 

{ 

void hufapp(unsigned long index[], unsigned long nprob[], unsigned long n, 
unsigned long i); 
int ibit; 
long node,*up; 

unsigned long j,k,*index ) n,nused,*nprob; 

static unsigned long setbit[32]={0xlL,0x2L,0x4L,0x8L,0xl0L,0x20L, 

0x40L,0x80L,OxlOOL,0x200L,0x400L,0x800L,OxlOOOL,0x2000L, 

0x4000L, 0x8000L, OxlOOOOL, 0x20000L, 0x40000L, 0x80000L, OxlOOOOOL, 

0x200000L, 0x400000L, 0x800000L, OxlOOOOOOL, 0x2000000L, 0x4000000L, 
0x8000000L,OxlOOOOOOOL,0x20000000L,0x40000000L,0x80000000L>; 


hcode->nch=nchin; Initialization. 

index=lvector(1,(long)(2*hcode->nch-l)); 

up=(long *)lvector(l, (long) (2*hcode->nch-l)); Vector that will keep track of 
nprob=lvector(l,(long)(2*hcode->nch-l)); heap, 

for (nused=0,j=l;j<=hcode->nch;j++) { 
nprob [j] =nfreq[j] ; 
hcode->icod[j]=hcode->ncod[j]=0; 
if (nfreq[j]) index[++nused]=j; 

> 

for (j=nused;j>=l;j—) hufapp(index,nprob,nused,j); 

Sort nprob into a heap structure in index. 
k=hcode->nch; 

while (nused > 1) { Combine heap nodes, remaking 

node=index [1] ; the heap at each stage, 

index[1]=index[nused—] ; 
hufapp(index,nprob,nused,1); 
nprob [++k]=nprob[index[1]]tnprob[node]; 

hcode->left[k]=node; Store left and right children of a 

hcode->right [k] =index [1] ; node. 

up[index[l]] = -(long)k; Indicate whether a node is a left 

up[node]=index[l]=k; or right child of its parent. 

hufapp(index,nprob,nused,1); 

> 

up[hcode->nodemax=k] =0; 

for (j=l;j<=hcode->nch;j++) { Make the Huffman code from 

if (nprob[j]) { the tree. 

for (n=0,ibit=0,node=up[j];node;node=up[node],ibit++) { 
if (node < 0) { 

n |= setbit [ibit]; 
node = -node; 



hcode->icod[j]=n; 
hcode->ncod[]]=ibit; 

> 

> 

*nlong=0; 

for (j=l;j<=hcode->nch;j++) { 

if (hcode->ncod[j] > *nlong) { 
*nlong=hcode->ncod[j]; 
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*ilong=j-l; 

> 

> 

free_lvector(nprob,1,(long)(2*hcode->nch-l)); 
free_lvector((unsigned long *)up,1,(long)(2*hcode->nch-l)); 
free_lvector(index,1,(long)(2*hcode->nch-l)); 


void hufapp(unsigned long index[], unsigned long nprob[], unsigned long n, 
unsigned long i) 

Used by hufmak to maintain a heap structure in the array index [1. .1] . 

{ 

unsigned long j,k; 

k=index [i]; 

while (i <= (n»l)) { 

if ((j = i « 1) < n kk nprob [index [j]] > nprob [index [j+1] ]) j++; 
if (nprob[k] <= nprob[index[j]]) break; 
index[i]=index[j] ; 

i=j; 

> 

index[i]=k; 


Note that the structure hcode must be defined and allocated in your main 
program with statements like this: 

#include "nrutil.h" 

#define MC 512 Maximum anticipated value of nchin in hufmak. 

#define MQ (2*MC-1) 
typedef struct { 

unsigned long *icod,*ncod,*left,*right,nch,nodemax; 

} huffcode; 

huffcode hcode; 

hcode.icod= (unsigned long *)lvector(l,MQ) ; Allocate space within hcode. 

hcode.ncod=(unsigned long *)lvector(l,MQ); 

hcode.left=(unsigned long *)lvector(l,MQ) ; 

hcode.right=(unsigned long *)lvector(l,MQ); 

for (j=l;j<=MQ;j++) hcode.icod[j]=hcode.ncod[j]=0; 

Once the code is constructed, one encodes a string of characters by repeated calls 
to hufenc, which simply does a table lookup of the code and appends it to the 
output message. 

#include <stdio.h> 

#include <stdlib.h> 

typedef struct { 

unsigned long *icod,*ncod,*left,*right,nch,nodemax; 

> huffcode; 

void hufenc(unsigned long ich, unsigned chair **codep, unsigned long *lcode, 
unsigned long *nb, huffcode *hcode) 

Huffman encode the single character ich (in the range 0..nch-l) using the code in the 
structure hcode, write the result to the character array *codep[l. .lcode] starting at bit 
nb (whose smallest valid value is zero), and increment nb appropriately. This routine is called 
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repeatedly to encode consecutive characters in a message, but must be preceded by a single 
initializing call to hufmak, which constructs hcode. 

{ 

void nrerror(char error_text[]); 
int l,n; 

unsigned long k,nc; 

static unsigned long setbit[32]={0xlL,0x2L,0x4L,0x8L,0xl0L,0x20L, 

0x40L,0x80L,OxlOOL,0x200L,0x400L,0x800L,OxlOOOL,0x2000L, 

0x4000L, 0x8000L, OxlOOOOL, 0x20000L, 0x40000L, 0x80000L, OxlOOOOOL, 

0x200000L,0x400000L,0x800000L,OxlOOOOOOL,0x2000000L,0x4000000L, 
0x8000000L,OxlOOOOOOOL,0x20000000L,0x40000000L,0x80000000L>; 

k=ich+l; 

Convert character range 0. .nch-1 to array index range 1. .nch. 
if (k > hcode->nch I I k < 1) nrerror("ich out of range in hufenc."); 
for (n=hcode->ncod[k]-l;n>=0;n—,++(*nb)) { Loop over the bits in the stored 
nc=(*nb » 3); Huffman code for ich. 

if (++nc >= *lcode) { 

fprintf(stderr,"Reached the end of the ’code’ array.\n"); 
fprintf(stderr,"Attempting to expand its size.\n"); 

*lcode *=1.5; 

if ((*codep=(unsigned char *)realloc(*codep, 

(unsigned)(*lcode*sizeof(unsigned char)))) == NULL) { 
nrerrorC'Size expansion failed."); 

> 

> 

l=(*nb) & 7; 

if (11) (*codep) [nc]=0; Set appropriate bits in code, 

if (hcode->icod[k] & setbit[n]) (*codep)[nc] |= setbit [1]; 

> 

> 


Decoding a Huffman-encoded message is slightly more complicated. The 
coding tree must be traversed from the top down, using up a variable number of bits: 


typedef struct { 

unsigned long *icod,*ncod,*left,*right,nch,nodemax; 
> huffcode; 


void hufdec(unsigned long *ich, unsigned char *code, unsigned long lcode, 
unsigned long *nb, huffcode *hcode) 

Starting at bit number nb in the character array code [1. . lcode] , use the Huffman code stored 
in the structure hcode to decode a single character (returned as ich in the range 0. .nch-1) 
and increment nb appropriately. Repeated calls, starting with nb = 0 will return successive 
characters in a compressed message. The returned value ich=nch indicates end-of-message. 
The structure hcode must already have been defined and allocated in your main program, and 
also filled by a call to hufmak. 

{ 

long nc,node; 

static unsigned char setbit[8]={0xl,0x2,0x4,0x8,0x10,0x20,0x40,0x80}; 


node=hcode->nodemax; 
for (;;) { 

nc=(*nb » 3); 
if (++nc > lcode) { 
*ich=hcode->nch; 
return; 


Set node to the top of the decoding tree, and loop 
until a valid character is obtained. 

Ran out of input; with ich=nch indicating end of 
message. 


node=(code[nc] & setbit[7 & (*nb)++] ? 

hcode->right[node] : hcode->left[node] ); 

Branch left or right in tree, depending on its value, 
if (node <= hcode->nch) { If we reach a terminal node, we have a complete 
character and can return. 
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> 


> 

> 


*ich=node-l; 
return; 


For simplicity, hufdec quits when it runs out of code bytes; if your coded 
message is not an integral number of bytes, and if N c h is less than 256, hufdec can 
return a spurious final character or two, decoded from the spurious trailing bits in 
your last code byte. If you have independent knowledge of the number of characters 
sent, you can readily discard these. Otherwise, you can fix this behavior by providing 
a bit, not byte, count, and modifying the routine accordingly. (When N is 256 or 
larger, hufdec will normally run out of code in the middle of a spurious character, 
and it will be discarded.) 

Run-Length Encoding 

For the compression of highly correlated bit-streams (for example the black or 
white values along a facsimile scan line), Huffman compression is often combined 
with run-length encoding: Instead of sending each bit, the input stream is converted to 
a series of integers indicating how many consecutive bits have the same value. These 
integers are then Huffman-compressed. The Group 3 CCITT facsimile standard 
functions in this manner, with a fixed, immutable, Huffman code, optimized for a 
set of eight standard documents [8,9]. 
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20.5 Arithmetic Coding 

We saw in the previous section that a perfect (entropy-bounded) coding scheme 
would use Li = — log 2 Pi bits to encode character i (in the range 1 < i < N c h), 
if Pi is its probability of occurrence. Huffman coding gives a way of rounding the 
L^s to close integer values and constructing a code with those lengths. Arithmetic 
coding [1 ], which we now discuss, actually does manage to encode characters using 
noninteger numbers of bits! It also provides a convenient way to output the result 
not as a stream of bits, but as a stream of symbols in any desired radix. This latter 
property is particularly useful if you want, e.g., to convert data from bytes (radix 
256) to printable ASCII characters (radix 94), or to case-independent alphanumeric 
sequences containing only A-Z and 0-9 (radix 36). 

In arithmetic coding, an input message of any length is represented as a real 
number R in the range 0 < R < 1. The longer the message, the more precision 
required of R. This is best illustrated by an example, so let us return to the fictitious 
language, Vowellish, of the previous section. Recall that Vowellish has a 5 character 
alphabet (A, E, I, O, U), with occurrence probabilities 0.12, 0.42, 0.09, 0.30, and 
0.07, respectively. Figure 20.5.1 shows how a message beginning “IOU” is encoded: 
The interval [0,1) is divided into segments corresponding to the 5 alphabetical 
characters; the length of a segment is the corresponding pi. We see that the first 
message character, “I”, narrows the range of R to 0.37 < R < 0.46. This interval is 
now subdivided into five subintervals, again with lengths proportional to the p i ’s. The 
second message character, “O”, narrows the range of R to 0.3763 < R < 0.4033. 
The “U” character further narrows the range to 0.37630 < R < 0.37819. Any value 
of R in this range can be sent as encoding “IOU”. In particular, the binary fraction 
.011000001 is in this range, so “IOU” can be sent in 9 bits. (Huffman coding took 
10 bits for this example, see §20.4.) 

Of course there is the problem of knowing when to stop decoding. The fraction 
.011000001 represents not simply “IOU,” but “IOU...,” where the ellipses represent 
an infinite string of successor characters. To resolve this ambiguity, arithmetic 
coding generally assumes the existence of a special N c h + 1th character, EOM 
(end of message), which occurs only once at the end of the input. Since EOM 
has a low probability of occurrence, it gets allocated only a very tiny piece of 
the number line. 

In the above example, we gave R as a binary fraction. We could just as well 
have output it in any other radix, e.g., base 94 or base 36, whatever is convenient 
for the anticipated storage or communication channel. 

You might wonder how one deals with the seemingly incredible precision 
required of R for a long message. The answer is that R is never actually represented 
all at once. At any give stage we have upper and lower bounds for R represented 
as a finite number of digits in the output radix. As digits of the upper and lower 
bounds become identical, we can left-shift them away and bring in new digits at the 
low-significance end. The routines below have a parameter NWK for the number of 
working digits to keep around. This must be large enough to make the chance of 
an accidental degeneracy vanishingly small. (The routines signal if a degeneracy 
ever occurs.) Since the process of discarding old digits and bringing in new ones is 
performed identically on encoding and decoding, everything stays synchronized. 
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The routine arcmak constructs the cumulative frequency distribution table used 
to partition the interval at each stage. In the principal routine arcode, when an 
interval of size jdif is to be partitioned in the proportions of some n to some ntot, 
say, then we must compute (n* j dif ) /ntot. With integer arithmetic, the numerator 
is likely to overflow; and, unfortunately, an expression like jdif/(ntot/n) is not 
equivalent. In the implementation below, we resort to double precision floating 
arithmetic for this calculation. Not only is this inefficient, but different roundoff 
errors can (albeit very rarely) make different machines encode differently, though any 
one type of machine will decode exactly what it encoded, since identical roundoff 
errors occur in the two processes. For serious use, one needs to replace this floating 
calculation with an integer computation in a double register (not available to the 
C programmer). 

The internally set variable minint, which is the minimum allowed number 
of discrete steps between the upper and lower bounds, determines when new low- 
significance digits are added, minint must be large enough to provide resolution of 
all the input characters. That is, we must have Pi x minint > 1 for all i. A value 
of 100 N c h, or 1.1/ min pi, whichever is larger, is generally adequate. However, for 
safety, the routine below takes minint to be as large as possible, with the product 
minint*nraddjust smaller than overflow. This results in some time inefficiency, 
and in a few unnecessary characters being output at the end of a message. You can 
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decrease minint if you want to live closer to the edge. 

A final safety feature in arcmak is its refusal to believe zero values in the table 
nfreq; a 0 is treated as if it were a 1. If this were not done, the occurrence in a 
message of a single character whose nfreq entry is zero would result in scrambling 
the entire rest of the message. If you want to live dangerously, with a very slightly 
more efficient coding, you can delete the IMAX( , 1) operation. 

tinclude "nrutil.h" 

tinclude <limits.h> ANSI header file containing integer ranges. 

#define MC 512 

#ifdef ULONG.MAX Maximum value of unsigned long. 

#def ine MAXINT (ULONG.MAX » 1) 

#else 

#define MAXIMT 2147483647 
#endif 

Here MC is the largest anticipated value of nchh; MAXINT is a large positive integer that does 
not overflow. 

typedef struct { 

unsigned long *ilob,*iupb,*ncumfq,jdif,nc,minint,nch,ncum,nrad; 

> arithcode; 

void arcmak(unsigned long nfreq[], unsigned long nchh, unsigned long nradd, 
arithcode *acode) 

Given a table nfreq[l. .nchh] of the frequency of occurrence of nchh symbols, and given 
a desired output radix nradd, initialize the cumulative frequency table and other variables for 
arithmetic compression in the structure acode. 

{ 

unsigned long j; 

if (nchh > MC) nrerror("input radix may not exceed MC in arcmak."); 
if (nradd > 256) nrerror("output radix may not exceed 266 in arcmak."); 

acode->minint=MAXINT/nradd; 
acode->nch=nchh; 
acode->nrad=nradd; 
acode->ncumfq[1]=0; 
for (j=2;j<=acode->nch+l;j++) 

acode->ncumfq[j]=acode->ncumfq[j-l]+IMAX(nfreq[j-1] ,1); 
acode->ncum=acode->ncumfq[acode->nch+2]=acode->ncumfq[acode->nch+l]+1; 


The structure acode must be defined and allocated in your main program with 
statements like this: 

#include "nrutil.h" 

#define MC 512 Maximum anticipated value of nchh in arcmak. 

#define NWK 20 Keep this value the same as in arcode, below, 

typedef struct { 

unsigned long *ilob,*iupb,*ncumfq,jdif,nc,minint,nch,ncum,nrad; 

> arithcode; 

arithcode acode; 

acode.ilob=(misigned long *)lvector(l,NWK); 
acode.iupb=(misigned long *)lvector(l,NWK); 
acode.ncumfq=(unsigned long *)lvector(l,MC+2); 



Allocate space within acode. 
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Individual characters in a message are coded or decoded by the routine arcode, 
which in turn uses the utility arcsum. 


#include <stdio.h> 

#include <stdlib.h> 

#def ine NWK 20 

#define JTEY(j,k,m) ((long)((((double)(k))*((double)(j)))/((double)(m)))) 

This macro is used to calculate (k*j)/m without overflow. Program efficiency can be improved 
by substituting an assembly language routine that does integer multiply to a double register. 

typedef struct { 

unsigned long *ilob,*iupb,*ncumfq,jdif,nc,minint,nch,ncum,nrad; 

> arithcode; 

void arcode (unsigned long *ich, unsigned char **c.odep, unsigned long *lcode, 
unsigned long *lcd, int isign, arithcode *acode) 

Compress (isign = 1) or decompress (isign = —1) the single character ich into or out of 
the character array ♦ codep[l. .lcode] , starting with byte *codep[lcd] and (if necessary) 
incrementing led so that, on return, led points to the first unused byte in *codep. Note 
that the structure acode contains both information on the code, and also state information on 
the particular output being written into the array *codep. An initializing call with isign=0 
is required before beginning any *codep array, whether for encoding or decoding. This is in 
addition to the initializing call to aremak that is required to initialize the code itself. A call 
with ich=nch (as set in aremak) has the reserved meaning "end of message.” 

{ 

void arcsum(unsigned long iin[] , unsigned long iout[], unsigned long ja, 
int nwk, unsigned long nrad, unsigned long nc); 
void nrerror(char error_text []); 
int j,k; 

unsigned long ihi,ja,jh,jl,m; 

if (I isign) { Initialize enough digits of the upper and lower bounds. 

acode->jdif=acode->nrad-l; 
for (j=NWK;j>=l;j—) { 

acode->iupb[j]=acode->nrad-l; 
acode->ilob[j]=0; 
acode->nc=j; 

if (acode->jdif > acode->minint) return; Initialization complete. 
acode->jdif=(acode->jdif+l)*acode->nrad-l; 

> 

nrerrorC'NWK too small in arcode."); 

} else { 

if (isign > 0) { If encoding, check for valid input character, 

if (*ich > acode->nch) nrerrorC'bad ich in arcode."); 

> 

else { If decoding, locate the character ich by bisection. 

ja=(*codep)[*lcd]-acode->ilob[acode->nc]; 
for (j=acode->nc+l;j<=NWK;j++) { 
ja *= acode->nrad; 

ja += ((*codep)[*lcd+j-acode->nc]-acode->ilob[j]); 

} 

ihi=acode->nch+l; 

*ich=0; 

while (ihi-(*ich) > 1) { 
m=(*ich+ihi)»l; 

if (ja >= JTRY(acode->jdif,acode->ncumfq[m+l],acode->ncum)) 
*ich=m; 
else ihi=m; 

> 

if (*ich == acode->nch) return; Detected end of message. 

> 

Following code is common for encoding and decoding. Convert character ich to a new 
subrange [ilob,iupb). 





imple page from NUMERICAL RECIPES IN C: THE ART OF SCIENTIFIC COMPUTING (ISBN 0-521 -43108-5) 






914 


Chapter 20. Less-Numerical Algorithms 


jh=JTRY(acode->jdif,acode->ncumfq[*ich+2],acode->ncum); 
jl=JTRY(acode->jdif,acode->ncumfq[*ich+l],acode->ncum); 
acode->jdif=j h-j1; 

arcsum(acode->ilob,acode->iupb,jh,NWK,acode->nrad,acode->nc); 
arcsum(acode->ilob,acode->ilob,j1,NWK,acode->nrad,acode->nc); 

How many leading digits to output (if encoding) or skip over? 
for (j=acode->nc;j<=NWK;j++) { 

if (*ich != acode->nch && acode->iupb[j] != acode->ilob[j]) break; 

if (*lcd > *lcode) { 

fprintf(stderr,"Reached the end of the ’code’ array.\n"); 
fprintf(stderr,"Attempting to expand its size.\n"); 

*lcode += *lcode/2; 

if ((*codep=(unsigned char *)realloc(*codep, 

(unsigned)(*lcode*sizeof(unsigned char)))) == NULL) { 
nrerror("Size expansion failed"); 

> 

> 

if (isign > 0) (*codep)[*lcd]=(unsigned char)acode->ilob[j]; 
++(*lcd); 

> 

if (j > NWK) return; Ran out of message. Did someone forget to encode a 
acode->nc=j; terminating ncd? 

for(j=0;acode->jdif<acode->minint; j++) How many digits to shift? 

acode->jdif *= acode->nrad; 

if (acode->nc-j < 1) nrerror("NWK too small in arcode."); 
if (j) { Shift them, 

for (k=acode->nc;k<=NWK;k++) { 

acode->iupb[k-j]=acode->iupb[k]; 
acode->ilob[k-j]=acode->ilob[k]; 

> 

> 

acode->nc -= j; 

for (k=NWK-j+l;k<=NWK;k++) acode->iupb[k]=acode->ilob[k]=0; 

> 

return; Normal return. 

> 



void arcsum(unsigned long iin[], unsigned long iout[], unsigned long ja, 
int nwk, unsigned long nrad, unsigned long nc) 

Used by arcode. Add the integer j a to the radix nrad multiple-precision integer iin[nc. .nwk] . 
Return the result in iout[nc. .nwk], 

i 

int j,karry=0; 
unsigned long jtmp; 

for (j=nwk;j>nc;j—) { 
jtmp=ja; 
ja /= nrad; 

iout [j] =iin [j] + (jtmp-ja*nrad)+karry; 
if (iout[j] >= nrad) { 
iout [ j ] -= nr ad; 
karry=l; 

} else karry=0; 

> 

iout [nc] =i in [nc] + j a+karry; 

> 



If radix-changing, rather than compression, is your primary aim (for example 
to convert an arbitrary file into printable characters) then you are of course free to 
set all the components of nfreq equal, say, to 1. 
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20.6 Arithmetic at Arbitrary Precision 

Let’s compute the number tt to a couple of thousand decimal places. In doing 
so, we’ll learn some things about multiple precision arithmetic on computers and 
meet quite an unusual application of the fast Fourier transform (FFT). We’ll also 
develop a set of routines that you can use for other calculations at any desired level 
of arithmetic precision. 

To start with, we need an analytic algorithm for tt. Useful algorithms 
are quadratically convergent, i.e., they double the number of significant digits at 
each iteration. Quadratically convergent algorithms for tt are based on the AGM 
(arithmetic geometric mean) method, which also finds application to the calculation 
of elliptic integrals (cf. §6.11) and in advanced implementations of the ADI method 
for elliptic partial differential equations (§19.5). Borwein and Borwein [1 ] treat this 
subject, which is beyond our scope here. One of their algorithms for tt starts with 
the initializations 


X 0 = \/2 

tt 0 = 2+V2 (20.6.1) 

Y 0 = </2 


and then, for * = 0,1 ,..., repeats the iteration 




Tti+l 


%| 



(trr 1 ) 



Yi + 1 


( 20 . 6 . 2 ) 


The value tt emerges as the limit tt^. 

Now, to the question of how to do arithmetic to arbitrary precision: In a 
high-level language like C, a natural choice is to work in radix (base) 256, so that 
character arrays can be directly interpreted as strings of digits. At the very end of 
our calculation, we will want to convert our answer to radix 10, but that is essentially 
a frill for the benefit of human ears, accustomed to the familiar chant, “three point 
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one four one five nine...For any less frivolous calculation, we would likely never 
leave base 256 (or the thence trivially reachable hexadecimal, octal, or binary bases). 

We will adopt the convention of storing digit strings in the “human” ordering, 
that is, with the first stored digit in an array being most significant, the last stored digit 
being least significant. The opposite convention would, of course, also be possible. 
“Carries,” where we need to partition a number larger than 255 into a low-order 
byte and a high-order carry, present a minor programming annoyance, solved, in the 
routines below, by the use of the macros LOBYTE and HIBYTE. 

It is easy at this point, following Knuth [2], to write a routine for the “fast” 
arithmetic operations: short addition (adding a single byte to a string), addition, 
subtraction, short multiplication (multiplying a string by a single byte), short 
division, ones-complement negation; and a couple of utility operations, copying 
and left-shifting strings. (On the diskette, these functions are all in the single 
file mpops.c.) 


#define LOBYTE(x) ((unsigned char) ((x) & Oxff)) 

#define HIBYTE(x) ((unsigned char) ((x) » 8 & Oxff)) 

Multiple precision arithmetic operations done on character strings, interpreted as radix 256 
numbers. This set of routines collects the simpler operations. 

void mpadd(unsigned char w[] , unsigned char u[] , unsigned char v[], int n) 

Adds the unsigned radix 256 integers u[l. .n] and v[l. .n] yielding the unsigned integer 
w[l. .n+1] . 

{ 

int j ; 

unsigned short ireg=0; 

for (j=n;j>=l;j—) { 

ireg=u [ j 1 +v [ j ] +HIBYTE ( ireg); 
w[j+l]=LOBYTE(ireg) ; 

> 

y[1]=HIBYTE(ireg); 

> 

void mpsub(int *is, unsigned char w[] , unsigned char u[] , unsigned char v[] , 
int n) 

Subtracts the unsigned radix 256 integer v [1. .n] from u[l. .n] yielding the unsigned integer 
w[l. .n] . If the result is negative (wraps around), is is returned as —1; otherwise it is returned 
as 0. 

{ 

int j; 

unsigned short ireg=256; 

for (j=n;j>=l;j—) { 

ireg=255+u[j]-v[j]+HIBYTE(ireg); 
w[j]=L0BYTE(ireg) ; 

> 

*is=HIBYTE(ireg)-l; 

> 

void mpsad(unsigned char w[], unsigned char u[], int n, int iv) 

Short addition: the integer iv (in the range 0 < iv < 255) is added to the unsigned radix 256 
integer u[l. .n] , yielding w[l. .n+1] . 

{ 

int j; 

unsigned short ireg; 



ireg=256*iv; 
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for (j=n;j>=l;j—) { 

ireg=u [ j ] +HIBYTE ( ireg); 
w[j+l]=LOBYTE(ireg); 

> 

w [1]=HIBYTE(ireg); 

> 

void mpsmu(unsigned char w[], unsigned char u[], int n, int iv) 

Short multiplication: the unsigned radix 256 integer u[l. .n] is multiplied by the integer iv 
(in the range 0 < iv < 255), yielding w[l. .n+1], 

{ 

int j; 

unsigned short ireg=0; 

for (j=n;j>=l;j—) { 

ireg=u[j]*iv+HIBYTE(ireg); 
w [ j+1]=L0BYTE(ireg); 

> 

w [1]=HIBYTE(ireg); 

> 

void mpsdv(unsigned char w[], unsigned char u[], int n, int iv, int *ir) 

Short division: the unsigned radix 256 integer u[l. .n] is divided by the integer iv (in the 
range 0 < iv < 255), yielding a quotient w[l. .n] and a remainder ir (with 0 < ir < 255). 
{ 

int i,j; 


*ir=0; 

for (j=l;j<=n;j++) { 
i=256*(*ir)+u[j] ; 
w[j]=(unsigned char) (i/iv); 

*ir=i % iv; 

> 

} 

void mpneg(unsigned char u[], int n) 

Ones-complement negate the unsigned radix 256 integer u[l. ,n]. 

{ 

int j; 

unsigned short ireg=256; 

for (j=n;j>=l;j—) { 

ireg=255-u[j]+HIBYTE(ireg); 
u [j ] =L0BYTE ( ireg) ; 

> 

} 

void mpmov(unsigned char u[] , unsigned char v[], int n) 
Move v[l. .n] onto u[l. .n] . 

{ 

int j; 

for (j=l;j<=n;j++) u[j]=v[j]; 

> 

void mplsh(unsigned char u[] , int n) 

Left shift u(2. .n+1) onto u[l. .n], 

{ 

int j; 

for (j=l; j<=n; j++) u[j]=u[j+l] ; 

> 
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Full multiplication of two digit strings, if done by the traditional hand method, 
is not a fast operation: In multiplying two strings of length N, the multiplicand 
would be short-multiplied in turn by each byte of the multiplier, requiring 0(N 2 ) 
operations in all. We will see, however, that all the arithmetic operations on numbers 
of length N can in fact be done in 0(N x log N x log log N) operations. 

The trick is to recognize that multiplication is essentially a convolution (§13.1) 
of the digits of the multiplicand and multiplier, followed by some kind of carry 
operation. Consider, for example, two ways of writing the calculation 456 x 789: 


456 
x 789 
4104 
3648 
3192 
359784 


4 5 6 

x 7 8 9 

36 45 54 

32 40 48 
28 35 42 
28 67 118 93 54 

3 5 9 7 8 4 


The tableau on the left shows the conventional method of multiplication, in which 
three separate short multiplications of the full multiplicand (by 9, 8, and 7) are 
added to obtain the final result. The tableau on the right shows a different method 
(sometimes taught for mental arithmetic), where the single-digit cross products are 
all computed (e.g. 8x6 = 48), then added in columns to obtain an incompletely 
carried result (here, the list 28,67,118,93,54). The final step is a single pass from 
right to left, recording the single least-significant digit and carrying the higher digit 
or digits into the total to the left (e.g. 93 + 5 = 98, record the 8, carry 9). 

You can see immediately that the column sums in the right-hand method are 
components of the convolution of the digit strings, for example 118 = 4x9 + 5x 
8 + 6x7. In §13.1 we learned how to compute the convolution of two vectors by 
the fast Fourier transform (FFT): Each vector is FFT’d, the two complex transforms 
are multiplied, and the result is inverse-FFT’d. Since the transforms are done with 
floating arithmetic, we need sufficient precision so that the exact integer value of 
each component of the result is discernible in the presence of roundoff error. We 
should therefore allow a (conservative) few times log 2 (log 2 N) bits for roundoff 
in the FFT. A number of length N bytes in radix 256 can generate convolution 
components as large as the order of (256) 2 N, thus requiring 16 + log 2 N bits of 
precision for exact storage. If it is the number of bits in the floating mantissa 
(cf. §20.1), we obtain the condition 



16 + log 2 N + few x log 2 log 2 N < it (20.6.3) Jl I | 

■ « 5: 

We see that single precision, say with it = 24, is inadequate for any interesting 
value of N, while double precision, say with it = 53, allows N to be greater 
than 10 6 , corresponding to some millions of decimal digits. The following routine 
therefore presumes double precision versions of realft (§12.3) and f ourl (§12.2), 
here called drealft and df ourl. (These routines are included on the Numerical 
Recipes diskettes.) 
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#include "nrutil.h" 
#define RX 256.0 


void mpmul(unsigned char w[] , unsigned char u[], unsigned char v[] , int n, 
int m) 

Uses Fast Fourier Transform to multiply the unsigned radix 256 integers u[l. .n] and v[l. ,m], 
yielding a product w[l. .n+m] . 

{ 

void drealft(double data[] , unsigned long n, int isign); double version of realft. 
int j,mn,nn=l; 
double cy,t,*a,*b; 

mn=IMAX(m,n); 

while (nn < mn) nn «= 1; Find the smallest usable power of two for the trans- 

nn «= 1; form. 

a=dvector(1,nn); 
b=dvector(1,nn); 

for (j=l;j<=n;j++) Move £7 to a double precision floating array. 

a [ j] = (double)u[j]; 
for (j=n+l;j<=nn;j++) a[j]=0.0; 

for (j=l; j<=m; j++) Move V to a double precision floating array. 

b[j] = (double)v[j] ; 
for (j=m+l;j<=nn;j++) b[j]=0.0; 

drealft(a,nn, 1); Perform the convolution: First, the two Fourier trans- 

drealft(b,nn, 1); forms. 

b[l] *= a[l] ; Then multiply the complex results (real and imagi- 

b[2] *= a[2] ; nary parts), 

for (j=3;j<=nn;j+=2) { 

b[j] = (t=b[j] )*a[j] -b [j+1] *a[j+l] ; 
b[j+l]=t*a[j+l]+b[j+l]*a[j] ; 

> 

drealft(b,nn,-l) ; Then do the inverse Fourier transform. 

cy=0.0; Make a final pass to do all the carries, 

for (j=nn;j>=l;j—) { 

t=b[j]/(nn»l)+cy+0.5; The 0.5 allows for roundoff error. 

cy=(unsigned long) (t/RX); 

b[j]=t-cy*RX; 

> 

if (cy >= RX) nrerror("cannot happen in fftmul"); 
w[l] = (unsigned char) cy; Copy answer to output, 

for (j=2;j<=n+m;j++) 

w[j] = (unsigned char) b[j—1] ; 
free_dvector(b,l,nn); 
free_dvector(a,l,nn); 



With multiplication thus a “fast” operation, division is best performed by 
multiplying the dividend by the reciprocal of the divisor. The reciprocal of a value 
V is calculated by iteration of Newton’s rule, 

U i+1 = Ut(2 - VUi) (20.6.4) 



which results in the quadratic convergence of Uoo to 1 /V, as you can easily 
prove. (Many supercomputers and RISC machines actually use this iteration to 
perform divisions.) We can now see where the operations count N log N log log N, 
mentioned above, originates: N log N is in the Fourier transform, with the iteration 
to converge Newton’s rule giving an additional factor of log log N. 
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#include "nrutil.h" 

#define MF 4 

#define BI (1.0/256) 

void mpinv(unsigned char u[] , unsigned char v[], int n, int m) 

Character string v[l. .m] is interpreted as a radix 256 number with the radix point after 
(nonzero) v[1]; u[l. .n] is set to the most significant digits of its reciprocal, with the radix 
point after u[l] . 

{ 

void mpmov(unsigned char u[], unsigned char v[], int n); 

void mpmul(unsigned char w[], unsigned char u[], unsigned char vL] , int n, 
int m); 

void mpneg(unsigned char u[], int n) ; 
unsigned char *rr,*s; 
int i,j,maxmn,mm; 
float fu,fv; 

maxmn=IMAX(n,m); 
rr=cvector(l,l+(maxmn«l)); 
s=cvector(1,maxmn); 
mm=IMIN(MF,m); 
fv=(float) v[mm]; 
for (j=mm-l;j>=l;j—) { 
fv *= BI; 
fv += v[j] ; 

> 

fu=l.0/fv; 

for (j=l;j<=n;j++) { 
i=(int) fu; 

u[j] = (unsigned char) i; 
fu=256.0*(fu-i); 

> 

for (;;) { 

mpmul(rr,u,v,n,m); 
mpmov(s,&rr[1],n); 
mpneg(s,n); 
s[l] -= 254; 
mpmul(rr,s,u,n,n); 
mpmov(u,&rr[1],n); 
for (j=2;j<n;j++) 
if (s[j]) break; 
if (j==n) { 

free_cvector(s,l,maxmn); 
free_cvector(rr,l,l+(maxmn«l)); 
return; 

> 

> 

> 


Use ordinary floating arithmetic to get an initial ap¬ 
proximation. 


Iterate Newton's rule to convergence. 
Construct 2 — UV in S. 


Multiply SU into U. 


If fractional part of S is not zero, it has not converged 
to 1. 


Division now follows as a simple corollary, with only the necessity of calculating 
the reciprocal to sufficient accuracy to get an exact quotient and remainder. 

#include "nrutil.h" 

#define MACC 6 

void mpdiv(unsigned char q[] , unsigned char r[], unsigned char u[] , 
unsigned char v[], int n, int m) 

Divides unsigned radix 256 integers u[l. .n] by v[l. .m] (with m < n required), yielding a 
quotient q[l. .n-m+1] and a remainder r[l. .m] . 

{ 

void mpinv (unsigned char u[], unsigned char v[], int n, int m); 
void mpmov (unsigned char u[], unsigned char v[], int n); 
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void mpmul (unsigned char w[], unsigned char u[], unsigned char v L] , int n, 
int m); 

void mpsad(unsigned char w[], unsigned chair u[], int n, int iv); 
void mpsub(int *is, unsigned char w[], unsigned char u[] , unsigned char v[] , 
int n) ; 
int is; 

unsigned char *rr,*s; 

rr=cvector(l, (n+MACC)«l) ; 
s=cvector(l, (n+MACC)«l) ; 

mpinv(s,v,n+MACC,m); Set S = t/V. 

mpmul(rr,s,u,n+MACC,n) ; Set Q = SU . 

mpsad(s,rr,n+MACC-l,l); 
mpmov(q,&s[2],n-m+l); 

mpmul (rr jq.Vjn-m+l.m); Multiply and subtract to get the remainder. 

mpsub(&is,&rr[1],u,&rr[1],n); 

if (is) nrerror("MACC too small in mpdiv"); 

mpmov(r,fcrr[n-m+1],m); 

free_cvector(s,l, (n+MACC)«l); 

free_cvector(rr,1,(n+MACC)<<!); 


Square roots are calculated by a Newton’s rule much like division. If 

Vi+t = l -Ui{ 3 - VUf) (20.6.5) 

then Uoo converges quadratically to 1/ \[V. A final multiplication by V gives \/V. 

#include <math.h> 

#include "nrutil.h" 

#define MF 3 

#define BI (1.0/256) 

void mpsqrt(unsigned char w[], unsigned char u[] , unsigned char v[], int n, 
int m) 

Character string v[l. .m] is interpreted as a radix 256 number with the radix point after v[l] ; 
w[l. .n] is set to its square root (radix point after w [1] ), and u[l. .n] is set to the reciprocal 
thereof (radix point before u[l]). w and u need not be distinct, in which case they are set 
to the square root. 

{ 


void mplsh(unsigned 

char 

u[]. 

int n) ; 




void mpmov(unsigned 

char 

u[]. 

unsigned 

char 

v[], 

int n); 

void mpmul(unsigned 

char 

w[]. 

unsigned 

char 

u[], 

unsigned char v [] , int n 

int m); 







void mpneg(unsigned 

char 

u[]. 

int n) ; 




void mpsdv(unsigned 

char 

»[], 

unsigned 

char 

u[], 

int n, int iv, int *ir); 


int i,ir,j,mm; 
float fu,fv; 
unsigned char *r,*s; 

r=cvector (1 ,n«l); 
s=cvector(l ,n«l); 
mm=IMIN(m,MF); 
fv=(float) v[mm]; 
for (j=mm-l;j>=l;j—) { 
fv *= BI; 
fv += v[j] ; 

> 

fu=1.0/sqrt(fv); 
for (j=l;j<=n;j++) { 
i=(int) fu; 

u[j] = (unsigned char) i 


Use ordinary floating arithmetic to get an initial ap¬ 
proximation. 
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fu=256.0*(fu-i); 

> 

for (;;) { 

mpmul(r,u,u,n,n); 
mplsh(r,n); 

mpmuHsjrjVjnjIMINCm,]!)); 
mplsh(s,n); 
mpneg(s,n); 
s[l] -= 253; 
mpsdvCs^^^j&ir); 
for (j=2;j<n;j++) { 
if (s[j]) { 

mpmul(r,s,u,n,n); 
mpmov(u,&r[1],n); 
break; 

> 

} 

if (j<n) continue; 
mpmul(rjUjVjnjIMINCn^n)); 
mpmov(w,&r[1] ,n); 
free_cvector(s,l,n«l); 
free_cvector(r,l,n«l); 
return; 

> 

> 


Iterate Newton's rule to convergence. 
Construct S = (3 — VU 2 )/2. 


If fractional part of S is not zero, it has not converged 
to 1. 

Replace U by SU. 


Get square root from reciprocal and return. 


We already mentioned that radix conversion to decimal is a merely cosmetic 
operation that should normally be omitted. The simplest way to convert a fraction to 
decimal is to multiply it repeatedly by 10, picking off (and subtracting) the resulting 
integer part. This, has an operations count of 0(N 2 ), however, since each liberated 
decimal digit takes an 0(N ) operation. It is possible to do the radix conversion as 
a fast operation by a “divide and conquer” strategy, in which the fraction is (fast) 
multiplied by a large power of 10, enough to move about half the desired digits 
to the left of the radix point. The integer and fractional pieces are now processed 
independently, each further subdivided. If our goal were a few billion digits of 7 r, 
instead of a few thousand, we would need to implement this scheme. For present 
purposes, the following lazy routine is adequate: 


#def ine IAZ 48 

void mp2dfr(unsigned char a[], unsigned char s[], int n, int *m) 

Converts a radix 256 fraction a[l. .n] (radix point before a[l]) to a decimal fraction rep¬ 
resented as an ascii string s[l. .m] , where m is a returned value. The input array a[l. .n] 
is destroyed. NOTE: For simplicity, this routine implements a slow (oc N 2 ) algorithm. Fast 
(oc iVlniV), more complicated, radix conversion algorithms do exist. 

{ 

void mplsh(unsigned char u[], int n) ; 

void mpsmu(unsigned char w[], unsigned char u[], int n, int iv); 
int j; 

*m=(int) (2.408*n); 
for (j=l;j<=(*m);j++) { 
mpsmufa.a.n,10); 
s[j]=a[l]+IAZ; 
mplsh(a,n); 

} 

1 



Finally, then, we arrive at a routine implementing equations (20.6.1) and (20.6.2): 
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3.1415926535897932384626433832795028841971693993751058209749445923078164062 
862089986280348253421170679821480865132823066470938446095505822317253594081 
284811174502841027019385211055596446229489549303819644288109756659334461284 
756482337867831652712019091456485669234603486104543266482133936072602491412 
737245870066063155881748815209209628292540917153643678925903600113305305488 
204665213841469519415116094330572703657595919530921861173819326117931051185 
480744623799627495673518857527248912279381830119491298336733624406566430860 
213949463952247371907021798609437027705392171762931767523846748184676694051 
320005681271452635608277857713427577896091736371787214684409012249534301465 
495853710507922796892589235420199561121290219608640344181598136297747713099 
605187072113499999983729780499510597317328160963185950244594553469083026425 
223082533446850352619311881710100031378387528865875332083814206171776691473 
035982534904287554687311595628638823537875937519577818577805321712268066130 
019278766111959092164201989380952572010654858632788659361533818279682303019 
520353018529689957736225994138912497217752834791315155748572424541506959508 
295331168617278558890750983817546374649393192550604009277016711390098488240 
128583616035637076601047101819429555961989467678374494482553797747268471040 
475346462080466842590694912933136770289891521047521620569660240580381501935 
112533824300355876402474964732639141992726042699227967823547816360093417216 
412199245863150302861829745557067498385054945885869269956909272107975093029 
553211653449872027559602364806654991198818347977535663698074265425278625518 
184175746728909777727938000816470600161452491921732172147723501414419735685 
481613611573525521334757418494684385233239073941433345477624168625189835694 
855620992192221842725502542568876717904946016534668049886272327917860857843 
838279679766814541009538837863609506800642251252051173929848960841284886269 
456042419652850222106611863067442786220391949450471237137869609563643719172 
874677646575739624138908658326459958133904780275900994657640789512694683983 
525957098258226205224894077267194782684826014769909026401363944374553050682 
034962524517493996514314298091906592509372216964615157098583874105978859597 
729754989301617539284681382686838689427741559918559252459539594310499725246 
808459872736446958486538367362226260991246080512438843904512441365497627807 
977156914359977001296160894416948685558484063534220722258284886481584560285 
Figure 20.6.1. The first 2398 decimal digits of 7r, computed by the routines in this section. 


#include <stdio.h> 

#include "nrutil.h" 

#define IA0FF 48 

void mppi(int n) 

Demonstrate multiple precision routines by calculating and printing the first n bytes of 7r. 

{ 

void mp2dfr(unsigned char a[], unsigned char s[], int n, int *m); 
void mpadd(unsigned char w[], unsigned char u[], unsigned char v LJ , int n); 

void mpinv(unsigned char u[], unsigned char v[], int n, int m); 

void mplsh(unsigned char u[], int n) ; 

void mpmov (unsigned char u [] , unsigned char v[], int n); 

void mpmul(unsigned char w[], unsigned char u[], unsigned char v[], int n, 

int m); 

void mpsdv(unsigned char w[], unsigned char u[], int n, int iv, int *ir); 
void mpsqrt(unsigned char w[] , unsigned char u[], unsigned char v[] , int n, 
int m); 
int ir,j,m; 

unsigned char mm,*x,*y,*sx,*sxi,*t,*s,*pi; 

x=cvector(1,n+l); 
y=cvector(l ,n«l); 
sx=cvector(l,n); 
sxi=cvector(1,n); 
t=cvector (1 ,n«l); 
s=cvector(1,3*n); 
pi=cvector(l,n+l); 
t [1] =2; 

for (j=2;j<=n;j++) t[j]=0; 



Set T = 2. 
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mpsqrt(x,x,t,n,n); 
mpadd(pi,t,x,n); 
mplsh(pi,n); 
mpsqrt(sx,sxi,x,n,n); 
mpmov(y,sx,n); 
for (;;) { 

mpadd(x,sx,sxi,n); 
mpsdv(x,&x[l] ,n,2,&ir); 
mpsqrt(sxjsxi.x.n.n); 
mpmul(t,y,sx,n,n); 
mpadd(&t[l] ,&t[l] ,sxi,n); 
x[l]++; 
y[i]++; 

mpinv(s,y,n,n); 
mpmnl (y, &t [2] , s, n, n); 
mplsh(y,n); 
mpmnl(t, x, s, n, n); 
mm=t [2] -1; 

for (j=3;j<=n;j++) { 

if (t[j] != mm) break; 

> 

m=t[n+1]-mm; 
if (j <= n | | m > 1 | | m < -1) { 

mpmul(s,pi,&t[l] ,n,n); Set <t= T-7rj. 

mpmov(pi,&s[1],n); 

continue; 

> 

printf("pi=\n"); 
s [1] =pi [1] +IA0FF; 
s[2] = ’. ’; 

mp2dfr(&pi[1],&s[2],n-l,&m); 

Convert to decimal for printing. NOTE: The conversion routine, for this demonstration 
only, is a slow (oc N 2 ) algorithm. Fast (oc iVlniVj, more complicated, radix conversion 
algorithms do exist, 
s [m+3] =0; 

printf (" ’/.64s\n" ,&s [1] ) ; 
free_cvector(pi,1,n+l); 
free.cvector(s,1,3*n); 
free_cvector(t,l,n«l); 
free_cvector(sxi,1,n); 
free.cvector(sx,1,n); 
free_cvector(y,l,n«l); 
free_cvector(x,1,n+l); 
return; 

> 

> 


Set Xo 4= #2. 

Set 7to = 2 + \/2. 

Set Y 0 =f# /4 . 

Set A',;.| ! .. (X} /2 + X7 1/2 )/2. 

Form the temporary T = Yj,X^ 2 + X^[ 2 . 

Increment Xj+i and YJ by 1. 

Set Y i+1 = T/(Yi + 1). 

Form temporary T = (Xj+i + l)/(Yi + 1). 
If T = 1 then we have converged. 


Figure 20.6.1 gives the result, computed with n = 1000. As an exercise, you 
might enjoy checking the first hundred digits of the figure against the first 12 terms 
of Ramanujan’s celebrated identity [3] 


1 _ \/8 y, (4n)!(1103 + 26390n) 
7 T “ 9801 ^ (n! 396") 4 


( 20 . 6 . 6 ) 


using the above routines. You might also use the routines to verify that the 
number 2 512 + 1 is not a prime, but has factors 2,424,833 and 
7,455,602,825,647,884,208,337,395,736,200,454,918,783,366,342,657 (which are 
in fact prime; the remaining prime factor being about 7.416 x 10 98 ) [4], 
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